library("segregation")
Eliminating the bias of segregation indices
It is well known that most standard estimators of segregation indices are biased. The segregation package provides a few tools to assess this bias. This post will discuss this problem with some simple examples and show under what conditions bootstrapping and simulation can help to remove the bias. The post relies on some tools that were only recently added to the package, so install the most recent version to follow along:
Bias in small and large samples
To illustrate the problem, let’s use R’s stats::r2dtable
function to simulate a random contingency table. To make the following more concrete, let’s assume that we observe racial segregation in schools. Each school has an equal number of students of each of the two racial groups, but we only observe a sample. If the sample is small, we do not expect to sample exactly an even number of students of each of the two groups, so the segregation index is likely to be biased upwards.
One hypothetical sample could look like this:
mat = stats::r2dtable(1, rep(10, 5), c(25, 25))[[1]])
(#> [,1] [,2]
#> [1,] 5 5
#> [2,] 4 6
#> [3,] 3 7
#> [4,] 7 3
#> [5,] 6 4
Now we can compute the Mutual Information index (M) and its normalized version, the H index:
library("segregation")
= matrix_to_long(mat) # convert to long format
dat mutual_total(dat, "group", "unit", weight = "n")
#> stat est
#> <char> <num>
#> 1: M 0.0410
#> 2: H 0.0591
Clearly, both indices are non-zero. For the index of dissimilarity, the bias is even stronger:
dissimilarity(dat, "group", "unit", weight = "n")
#> stat est
#> <char> <num>
#> 1: D 0.24
A index value of 0.3 is often interpreted as “moderate segregation”, so this bias is clearly a problem. Generally, the index of dissimilarity suffers more from small-sample bias than the information-theoretic indices.
Importantly, the bias is not simply a function of sample size. For instance, if we increase the number of schools to 10,000, but still expect 5 students of each racial group in each school, the bias is pretty much the same:
= stats::r2dtable(1, rep(10, 10000), c(50000, 50000))[[1]]
mat_large = matrix_to_long(mat_large) # convert to long format
dat_large mutual_total(dat_large, "group", "unit", weight = "n")
#> stat est
#> <char> <num>
#> 1: M 0.0540
#> 2: H 0.0778
dissimilarity(dat_large, "group", "unit", weight = "n")
#> stat est
#> <char> <num>
#> 1: D 0.248
This is despite the fact that in the first case, our sample size is 50, and in the second case it’s 100,000! For the index of dissimilarity, Winship (1977) has described this bias in detail.
Solution 1: Bootstrapping
In many circumstances, it helps to enable bootstrapping to estimate the bias. When bootstrapping is enabled, the segregation
package reports bias-adjusted estimates. Let’s try this for both datasets from above:
mutual_total(dat, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 50 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: M -0.00933 0.0465 -0.0956, 0.0647 0.0503
#> 2: H -0.01520 0.0683 -0.1400, 0.0932 0.0743
mutual_total(dat_large, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 1e+05 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: M -0.000629 0.00118 -0.00296, 0.00159 0.0546
#> 2: H -0.000908 0.00170 -0.00427, 0.00230 0.0788
In this case, the bootstrap estimates the bias pretty well. Because the bias (last column) is subtracted from the segregation estimates, the bootstrap-adjusted estimate may become slightly negative.
For the index of dissimilarity, this procedure does not work as well:
dissimilarity(dat, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 50 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: D 0.171 0.0999 -0.0632, 0.3596 0.0689
dissimilarity(dat_large, "group", "unit", weight = "n", se = TRUE)
#> 100 bootstrap iterations on 1e+05 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: D 0.14 0.00201 0.137,0.145 0.108
Although the bias estimate is fairly large, a substantial bias remains.
Solution 2: Compute the expected value under independence
The bootstrap may sometime work to estimate the bias, but two major problems remain. The first, as we have seen, is that the bias estimation does not work well for the index of dissimilarity. The second situation in which the bootstrap will do badly is when the contingency table is very sparse and contains many zero entries. I’ll come back to that in the example at the end of the post.
A direct approach of estimating the bias is the following: Using the observed marginal distributions, simulate a contingency table under the assumption that true segregation is zero. Repeat this process a number of times and record the average. This quantity is the expected value of the segregation index when students are randomly distributed across schools, conditional on the marginal distributions. In economics, this quantity is also sometime called “random segregation” (Carrington and Troske 1998).
The segregation
package implements this algorithm in the following two functions:
mutual_expected(dat, "group", "unit", weight = "n")
#> stat est se
#> <char> <num> <num>
#> 1: M under 0 0.0443 0.0290
#> 2: H under 0 0.0639 0.0418
dissimilarity_expected(dat, "group", "unit", weight = "n")
#> stat est se
#> <char> <num> <num>
#> 1: D under 0 0.226 0.0945
In both cases, calculating the expected value of the index gives a good estimate of the bias. When reporting the final results, we could simply subtract the bias from the segregation estimates.
An example with sparse data
As a final point, the example in this section demonstrates some circumstances under which also the information-theoretic indices may be highly biased.
The segregation
package contains an example dataset, school_ses
with artifical data. Each row of this dataset describes a student, with information on the school the student attends (school_id
), the student’s ethnic group (one of A, B, or C; ethnic_group
), and the student’s socio-economic status (provided in quintiles; ses_quintile
). Because there are three ethnic-groups, we will only compute multigroup indices using the M and H index.
The school_ses
dataset is sparse: There are 149 schools in total, but only 46 of those contain students of all three ethnic groups, and 26 schools contain only students of a single ethnic group.
The ethnic segregation in this dataset is fairly large, but we may expect this estimate to be upwardly biased:
mutual_total(school_ses, "ethnic_group", "school_id")
#> stat est
#> <char> <num>
#> 1: M 0.544
#> 2: H 0.577
For this dataset, the two approaches of estimating the bias differ somewhat:
mutual_total(school_ses, "ethnic_group", "school_id", se = TRUE)
#> 100 bootstrap iterations on 5153 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: M 0.529 0.01000 0.512,0.545 0.0160
#> 2: H 0.559 0.00921 0.542,0.576 0.0181
mutual_expected(school_ses, "ethnic_group", "school_id")
#> stat est se
#> <char> <num> <num>
#> 1: M under 0 0.0304 0.00240
#> 2: H under 0 0.0322 0.00254
Using bootstrapping, the bias for the M index is estimated to be 0.016, while the bias estimated using the “random segregation” approach is 0.03.
This difference is still rather small, and will not be consequential in many situations. However, the advantage of using information-theoretic measures lies in their decomposability, and there the bias may be much larger. For instance, assume that we are interested in computing ethnic segregation conditionally on SES status. We can use the within argument
to calculate this:
mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
#> stat est
#> <char> <num>
#> 1: M 0.463
#> 2: H 0.490
Estimating the bias of this conditional index using bootstrapping yields a bias estimate of around 0.04:
mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile",
se = TRUE)
#> 100 bootstrap iterations on 5153 observations
#> stat est se CI bias
#> <char> <num> <num> <list> <num>
#> 1: M 0.424 0.00853 0.410,0.439 0.0389
#> 2: H 0.450 0.00909 0.433,0.465 0.0408
However, if we compute the expected value conditional on SES, the result looks very different:
mutual_expected(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
#> stat est se
#> <char> <num> <num>
#> 1: M under 0 0.105 0.00848
#> 2: H under 0 0.132 0.01113
The bias is estimated to be very large – around 0.1 for the M and around 0.13 for the H! The reason for this discrepancy is that the indices are computed within each group defined by the SES quintiles. These “conditional” contingency tables are much smaller, and even sparser than the overall dataset. It follows that the bias is even larger. One therefore has to be very careful when decomposing segregation measures for small or sparse samples.
Conclusion
When working with segregation indices, it is important to be aware that almost all “naive” estimators of these indices are upwardly biased. In many situations, this bias will be small. However, if the overall sample size is small, or some of the groups or units are small, the bias can be substantive. Importantly, it is not always the case that the bias is small in large samples. My recommendation is to always check the sensitivity of your results using both bootstrapping and by calculating “random segregation”. Special attention needs to be paid when decomposing segregation measures for small or sparse samples, as the decompositions will be based on even smaller/sparser samples.
References
Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. Social Forces 55(4): 1058-1066.
Carrington, William J. and Kenneth R. Troske. 1998. Interfirm Segregation and the Black/White Wage Gap. Journal of Labor Economics 16(2): 231-260.