# The Multiple Comparisons Problem with Multinomial Tests

There’s a multiple comparisons problem with multinomial tests, and it has a simple solution. In this post, I demonstrate what the problem is and how to overcome it.

## Multinomial Tests

Multinomial tests help in comparing discrete distributions. For example,
one might use a multinomial test to determine whether two subpopulations
look the same with respect to demographic characteristics. Supose we
have a dataset with individuals divided into two groups. In group *A* we
have individuals who voted in a US mid-term election. In group *B* we
have those who did not. For simplicity’s sake, assume the only
demographic variables we have for individuals in each group are gender
(*F* = 1 if female, 0 if male) and whether an individual has at least a
four-year degree (*D* = 1 if yes, 0 otherwise).

For each of the groups, there are four unique strata into which individuals can fall:

- Female and a four-year degree (
*F*= 1 and*D*= 1); - Female and no degree (
*F*= 1 and*D*= 0); - Male and a four-year degree (
*F*= 0 and*D*= 1); - Male and no degree (
*F*= 1 and*D*= 0).

The frequency distributions of each strata differ between groups. As the
below figure summarizes, women and those with a four-year degree are
over-represented among voters (group *A*) and are under-represented
among non-voters (group *B*).

The groups clearly look different, but is this difference statistically detectable? We can answer this question using a multinomial test, such as chi-squared:

```
library(tidyverse) # for grammar
library(seerrr) # for simulations
library(XNomial) # for multinomial tests
# perform chi-squared test
test <- xmonte(
obs = dt$freq[dt$group == "A"],
exp = dt$freq[dt$group == "B"],
statName = "Chisq"
)
```

```
##
## P value (Chisq) = 0 ± 0
```

```
test[c("observedChi", "pChi")] %>% bind_cols()
```

```
## # A tibble: 1 x 2
## observedChi pChi
## <dbl> <dbl>
## 1 644. 0
```

The output shows the computed chi-squared statistic with its Monte Carlo simulated p-value (this is a more robust alternative to a test based on the asymtotic chi-squared distribution). The p-value is well below the usual 0.05 threshhold, meaning we can reject the null that the subpulations are drawn from the same multinomial distribution.

## The Problem

The convenience of a multinomial test in this context is its relatively
few assumptions. It simply considers whether the frequencies in each
strata between groups *A* and *B* are too different from one another to
be the result of chance. Beyond this, it makes no assumptions about the
form or direction of this difference. This is a good justification for
using this kind of test for addressing questions like that posed above.

However, an often overlooked limitation of this approach is that
multinomial tests, though not explicitly specified as such, are based on
multiple comparisons. Unlike a t-test, for example, which considers the
likelihood that the difference in means of two groups is greater than
we’d expect by chance, a multinomial test like chi-squared considers
differences across *multiple* strata between groups. In the case of the
above example, four comparisons are being made on the basis of voting
status—not just one.

This is problematic from a testing perspective because multiple comparisons are subject to what’s known as the multiple testing problem. In short, as the number of statistical tests being considered grows, the likelihood of falsely rejecting the null hypothesis increases. This is what’s called the type I error or false positive rate.

That multinomial tests are subject to this problem isn’t obvious at first blush because they involve the estimation of a single statistic to summarize the differences between groups, and this statistic has one unique p-value. But the problem exists nonetheless. The below simulation illustrates this point.

When the null hypothesis is true, we expect to get a false positive 5
percent of the time. In the below code a possible distribution of
p-values under the null is simulated using `runif`

and is saved as the
object `p.null`

. The code then iteratively simulates two multinomial
distributions that have the same underlying data-generating process and
applies the chi-squared test. The calulated p-values for each run of the
simulation are contained in the object `sim_out`

.

```
# what the null distribution should look like
p.null <- runif(200)
# what the actual null distribution looks like
quiet <- function(x) {
sink(tempfile())
on.exit(sink())
invisible(force(x))
}
simulate(
N = 4,
a_freq = as.vector(rmultinom(1, 100, rep(0.25, len = N))),
b_freq = as.vector(rmultinom(1, 100, rep(0.25, len = N)))
) %>%
map_dfr(
~ quiet(xmonte(.$a_freq, .$b_freq,
statName = "Chisq"))[["pChi"]] %>%
tibble(p.value = .)
) -> sim_out
```

We would expect that, since the null is true (that the two groups being
compared are drawn from the same multinomial distribution), the
chi-squared test should reject the null about 5 percent of the time. But
when we plot the distribution of `p.null`

next to the distribution of
chi-squared p-values we clearly see that we reject the null far more
frequently than expected.

```
sim_out %>%
mutate(
p.bins = round(p.value, 2),
sig = ifelse(
p.bins <= 0.05,
"p < 0.05",
"p >= 0.05"
)
) %>%
group_by(sig, p.bins) %>%
count() %>%
ungroup %>%
mutate(n = n / sum(n)) -> p.data.sim
ggplot(p.data.sim) +
aes(
x = p.bins,
y = n,
fill = sig
) +
geom_col(color = "black", position = "identity") +
geom_vline(
xintercept = 0.05,
lty = 2,
color = "darkred"
) +
labs(
x = "p-values",
y = "Proportion",
subtitle = "chi-squared null distribution of p-values"
) +
theme(
legend.position = "none"
) -> p1
tibble(
p.null = p.null,
p.bins = round(p.null, 2),
sig = ifelse(p.bins <= 0.05, "yes", "no")
) %>%
group_by(p.bins, sig) %>%
count %>%
ungroup %>%
mutate(
n = n / sum(n)
) %>%
ggplot() +
aes(
x = p.bins,
y = n,
fill = sig
) +
geom_col(color = "black", position = "identity") +
geom_vline(
xintercept = 0.05,
lty = 2,
color = "darkred"
) +
labs(
x = "p-values",
y = "Proportion",
subtitle = "What the null distribution should look like"
) +
ylim(c(0, max(p.data.sim$n))) +
theme(
legend.position = "none"
) -> p2
p1 + p2
```

The left panel in the above figure shows the distribution of chi-squared p-values, and the right panel shows what the distribution of p-values should be under the null. The difference is stark; yet, this is the very distribution of p-values we would expect when making multiple comparisons.

The below script helps to illustrate the point. It generates p-value
distributions under the null for an increasing number of tests being
considered simultaneously (1 to 5). It then shows the proportion of
times the null is rejected at the *p* ≤ 0.05 level for at least one of
the tests:

```
# simulate 1 to 5 tests
tests <- 1:5
map_dfc(
tests,
~ tibble(p.value = runif(2000))
) -> out
# get the minimum p value for 1 to 5 tests
mins <- list()
for(i in tests) mins[[i]] <- lapply(1:nrow(out), function(x) {
out[x, 1:i] %>% t() %>% min()
}) %>% do.call(c, .)
# summarize the rejection rate per no. of tests
bind_cols(mins) %>%
summarize(
across(everything(), ~ mean(.x <= 0.05))
) %>%
pivot_longer(
1:5
) %>%
mutate(
tests = paste0(1:5, ifelse(1:5 == 1, " test", " tests"))
) %>%
ggplot() +
aes(
x = tests,
y = value
) +
geom_col(width = 0.5, color = "black") +
labs(
x = NULL,
y = "False-Positive Rate",
title = "Worsening false-positive rate"
) +
scale_y_continuous(
labels = scales::percent
) +
geom_hline(
yintercept = 0.05
)
```

Clearly, as the number of tests being considered increases, the likelihood that we incorrectly reject the null for at least one test increases as well.

## A Solution

This problem has a simple solution: adjust the significance level of the
multinomial test so that its false positive rate returns to 5 percent.
We usually denote the level of a test as *α* and set this to *α* = 0.05.
Denote *α*′ as the adjusted alpha level that preserves a 5 percent false
positive rate.

To recover *α*′ for a given desired *α* we can take a simulation
approach. This entails generating a Monte Carlo distribution of p-values
under the null and then setting *α*′ as the 5th percentile this
distribution.

Consider the distribution of chi-squared p-values under the null generated earlier:

To identify *α*′ all we need to do is calculate the 5th percentile of
this distribution. We can do this by writing:

```
alpha <- 0.05
alpha_prime <- as.vector(
quantile(sim_out$p.value, alpha)
)
alpha_prime # new alpha level
```

```
## [1] 0.0003785
```

Using this new level, we can recover a test with a 5 percent false positive rate:

## Putting it all together

In summary, multinomial tests have a multiple comparisons problem. This fact is easy to miss because multinomial tests, like chi-squared, involve the estimation of a single test statitic and related p-value. However, this single statistic reflects a summary of multiple comparisons within several strata in the data. As more strata are being compared, the likelihood of incorrectly rejecting the null increases.

This problem has a simple solution. By simulating a distribution of
p-values under the null via a Monte Carlo analysis, it is possible
identify a more appropriate test level for rejecting the null. The above
discussion described what this entails. For convenience, I’ve included
some code below that can be easily addapted to a variety of scenarios
for running such a Monte Carlo analysis. It only requires using the
`XNomial`

package to run.

```
# helper to keep xmonte from producing message
# (this is really anoying when running multple
# iterations)
quiet <- function(x) {
sink(tempfile())
on.exit(sink())
invisible(force(x))
}
# function to identify new test level
find_new_alpha <- function(
desired_alpha = 0.05, # the desired/original test level
group1_freq, # the frequency distribution for group 1
group2_freq, # the frequency distribution for group 2
sims = 1000, # no. of iterations to run
statName = "Chisq" # test to have xmonte estimate
) {
# expected probabilities
p <- group2_freq / sum(group2_freq)
# simulate null
sim.data <- lapply(
1:sims,
function(x) data.frame(
group1 = as.vector(
rmultinom(1, sum(group1_freq), p)
),
group2 = as.vector(
rmultinom(1, sum(group2_freq), p)
)
)
)
# run test
cat("\nSimulating p-values (be patient).......\n")
sim.p <- lapply(
sim.data,
function(data) quiet(XNomial::xmonte(
data$group1, data$group2, statName = statName
))[[5]]
)
cat("\nFinished!\n")
sim.p <- do.call(c, sim.p)
# compute new level
new_alpha <- as.vector(
quantile(sim.p, desired_alpha)
)
names(new_alpha) <- "reject the null if p <= to:"
# return
return(new_alpha)
}
```

To see how it works, we can use it on the simulated voting turnout data described at the beginning of this post (note that it will take some time to run):

```
new_alpha <- find_new_alpha(
group1_freq = dt$freq[dt$group=="A"],
group2_freq = dt$freq[dt$group=="B"]
)
```

```
##
## Simulating p-values (be patient).......
##
## Finished!
```

```
new_alpha # print the new alpha-level
```

```
## reject the null if p <= to:
## 0.007809
```

The above gives us our new threshold for rejecting the null. If a
different original test level is desired, simply override the default
for `desired_alpha`

.