Eliminating the bias of segregation indices

It is well known that most standard estimators of segregation indices are biased. The segregation package provides a few tools to assess this bias. This post will discuss this problem with some simple examples and show under what conditions bootstrapping and simulation can help to remove the bias. The post relies on some tools that were only recently added to the package, so install the most recent version from GitHub to follow along:

remotes::install_github("elbersb/segregation")

Bias in small and large samples

To illustrate the problem, let’s use R’s stats::r2dtable function to simulate a random contingency table. To make the following more concrete, let’s assume that we observe racial segregation in schools. Each school has an equal number of students of each of the two racial groups, but we only observe a sample. If the sample is small, we do not expect to sample exactly an even number of students of each of the two groups, so the segregation index is likely to be biased upwards.

One hypothetical sample could look like this:

(mat = stats::r2dtable(1, rep(10, 5), c(25, 25))[])
##      [,1] [,2]
## [1,]    5    5
## [2,]    4    6
## [3,]    3    7
## [4,]    7    3
## [5,]    6    4

Now we can compute the Mutual Information index (M) and its normalized version, the H index:

library("segregation")
dat = matrix_to_long(mat) # convert to long format
mutual_total(dat, "group", "unit", weight = "n")
##      stat    est
##    <char>  <num>
## 1:      M 0.0410
## 2:      H 0.0591

Clearly, both indices are non-zero. For the index of dissimilarity, the bias is even stronger:

dissimilarity(dat, "group", "unit", weight = "n")
##      stat   est
##    <char> <num>
## 1:      D  0.24

A index value of 0.3 is often interpreted as “moderate segregation”, so this bias is clearly a problem. Generally, the index of dissimilarity suffers more from small-sample bias than the information-theoretic indices.

Importantly, the bias is not simply a function of sample size. For instance, if we increase the number of schools to 10,000, but still expect 5 students of each racial group in each school, the bias is pretty much the same:

mat_large = stats::r2dtable(1, rep(10, 10000), c(50000, 50000))[]
dat_large = matrix_to_long(mat_large) # convert to long format
mutual_total(dat_large, "group", "unit", weight = "n")
##      stat    est
##    <char>  <num>
## 1:      M 0.0540
## 2:      H 0.0778

dissimilarity(dat_large, "group", "unit", weight = "n")
##      stat   est
##    <char> <num>
## 1:      D 0.248

This is despite the fact that in the first case, our sample size is 50, and in the second case it’s 100,000! For the index of dissimilarity, Winship (1977) has described this bias in detail.

Solution 1: Bootstrapping

In many circumstances, it helps to enable bootstrapping to estimate the bias. When bootstrapping is enabled, the segregation package reports bias-adjusted estimates. Let’s try this for both datasets from above:

mutual_total(dat, "group", "unit", weight = "n", se = TRUE)
##      stat      est     se              CI   bias
##    <char>    <num>  <num>          <list>  <num>
## 1:      M -0.00933 0.0465 -0.0956, 0.0647 0.0503
## 2:      H -0.01520 0.0683 -0.1400, 0.0932 0.0743

mutual_total(dat_large, "group", "unit", weight = "n", se = TRUE)
##      stat       est      se                CI   bias
##    <char>     <num>   <num>            <list>  <num>
## 1:      M -0.000629 0.00118 -0.00296, 0.00159 0.0546
## 2:      H -0.000908 0.00170 -0.00427, 0.00230 0.0788

In this case, the bootstrap estimates the bias pretty well. Because the bias (last column) is subtracted from the segregation estimates, the bootstrap-adjusted estimate may become slightly negative.

For the index of dissimilarity, this procedure does not work as well:

dissimilarity(dat, "group", "unit", weight = "n", se = TRUE)
##      stat   est     se              CI   bias
##    <char> <num>  <num>          <list>  <num>
## 1:      D 0.171 0.0999 -0.0632, 0.3596 0.0689

dissimilarity(dat_large, "group", "unit", weight = "n", se = TRUE)
##      stat   est      se          CI  bias
##    <char> <num>   <num>      <list> <num>
## 1:      D  0.14 0.00201 0.137,0.145 0.108

Although the bias estimate is fairly large, a substantial bias remains.

Solution 2: Compute the expected value under independence

The bootstrap may sometime work to estimate the bias, but two major problems remain. The first, as we have seen, is that the bias estimation does not work well for the index of dissimilarity. The second situation in which the bootstrap will do badly is when the contingency table is very sparse and contains many zero entries. I’ll come back to that in the example at the end of the post.

A direct approach of estimating the bias is the following: Using the observed marginal distributions, simulate a contingency table under the assumption that true segregation is zero. Repeat this process a number of times and record the average. This quantity is the expected value of the segregation index when students are randomly distributed across schools, conditional on the marginal distributions. In economics, this quantity is also sometime called “random segregation” (Carrington and Troske 1998).

The segregation package implements this algorithm in the following two functions:

mutual_expected(dat, "group", "unit", weight = "n")
##         stat    est     se
##       <char>  <num>  <num>
## 1: M under 0 0.0443 0.0290
## 2: H under 0 0.0639 0.0418

dissimilarity_expected(dat, "group", "unit", weight = "n")
##         stat   est     se
##       <char> <num>  <num>
## 1: D under 0 0.226 0.0945

In both cases, calculating the expected value of the index gives a good estimate of the bias. When reporting the final results, we could simply subtract the bias from the segregation estimates.

An example with sparse data

As a final point, the example in this section demonstrates some circumstances under which also the information-theoretic indices may be highly biased.

The segregation package contains an example dataset, school_ses with artifical data. Each row of this dataset describes a student, with information on the school the student attends (school_id), the student’s ethnic group (one of A, B, or C; ethnic_group), and the student’s socio-economic status (provided in quintiles; ses_quintile). Because there are three ethnic-groups, we will only compute multigroup indices using the M and H index.

The school_ses dataset is sparse: There are 149 schools in total, but only 46 of those contain students of all three ethnic groups, and 26 schools contain only students of a single ethnic group.

The ethnic segregation in this dataset is fairly large, but we may expect this estimate to be upwardly biased:

mutual_total(school_ses, "ethnic_group", "school_id")
##      stat   est
##    <char> <num>
## 1:      M 0.544
## 2:      H 0.577

For this dataset, the two approaches of estimating the bias differ somewhat:

mutual_total(school_ses, "ethnic_group", "school_id", se = TRUE)
##      stat   est      se          CI   bias
##    <char> <num>   <num>      <list>  <num>
## 1:      M 0.529 0.01000 0.512,0.545 0.0160
## 2:      H 0.559 0.00921 0.542,0.576 0.0181

mutual_expected(school_ses, "ethnic_group", "school_id")
##         stat    est      se
##       <char>  <num>   <num>
## 1: M under 0 0.0304 0.00240
## 2: H under 0 0.0322 0.00254
Using bootstrapping, the bias for the M index is estimated to be 0.016, while the bias estimated using the “random segregation” approach is 0.03.

This difference is still rather small, and will not be consequential in many situations. However, the advantage of using information-theoretic measures lies in their decomposability, and there the bias may be much larger. For instance, assume that we are interested in computing ethnic segregation conditionally on SES status. We can use the within argument to calculate this:

mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
##      stat   est
##    <char> <num>
## 1:      M 0.463
## 2:      H 0.490

Estimating the bias of this conditional index using bootstrapping yields a bias estimate of around 0.04:

mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile",
se = TRUE)
##      stat   est      se          CI   bias
##    <char> <num>   <num>      <list>  <num>
## 1:      M 0.424 0.00853 0.410,0.439 0.0389
## 2:      H 0.450 0.00909 0.433,0.465 0.0408

However, if we compute the expected value conditional on SES, the result looks very different:

mutual_expected(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
##         stat   est      se
##       <char> <num>   <num>
## 1: M under 0 0.105 0.00848
## 2: H under 0 0.132 0.01113

The bias is estimated to be very large – around 0.1 for the M and around 0.13 for the H! The reason for this discrepancy is that the indices are computed within each group defined by the SES quintiles. These “conditional” contingency tables are much smaller, and even sparser than the overall dataset. It follows that the bias is even larger. One therefore has to be very careful when decomposing segregation measures for small or sparse samples.

Conclusion

When working with segregation indices, it is important to be aware that almost all “naive” estimators of these indices are upwardly biased. In many situations, this bias will be small. However, if the overall sample size is small, or some of the groups or units are small, the bias can be substantive. Importantly, it is not always the case that the bias is small in large samples. My recommendation is to always check the sensitivity of your results using both bootstrapping and by calculating “random segregation”. Special attention needs to be paid when decomposing segregation measures for small or sparse samples, as the decompositions will be based on even smaller/sparser samples.

Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. Social Forces 55(4): 1058-1066.

Carrington, William J. and Kenneth R. Troske. 1998. Interfirm Segregation and the Black/White Wage Gap. Journal of Labor Economics 16(2): 231-260.