Logit and Heteroskedasticity
I know of few political scientists, economists, or other quantitative researchers who don’t use some form of robust standard errors when estimating linear regression models via OLS.
Why?
Because only under the most ideal and stringent of circumstances will classic OLS standard errors be efficient.[1]
Efficiency requires that the variance of the regression error is constant and that individual observations are independent of each other. In practice this usually never is the case. Instead, the errors may be heteroskedastic (having nonconstant variance) or may not all be independent (they may be clustered).
Let’s zero in on the problem of nonconstant variance.
When it comes to linear models, the solution to this problem is simple: just use a robust estimator for the standard errors (there are many to choose from). Such estimators take into account heterogeneity in the errors and thus provide more consistent standard errors and more appropriate statistical inferences.
But, for nonlinear models such as logit, the problem is a little more insidious, and, therefore, it requires doing more than simply using robust standard errors.
The Problem
To illustrate why the problem of heteroskedasticity has different implications between linear and nonlinear models, let’s take a quick look at a pair of regressions:

Y_{i} = β_{0} + β_{1}X_{i} + ϵ_{i}, ϵ_{i} ∼ 𝒩(0,σ_{i}^{2});

Pr (B_{i} = 1) = L[(β_{0}+β_{1}X_{i})/σ_{i}].
Equation 1 is a linear model and equation 2 is a logit model, where the function L( ⋅ ) is just the logistic function:
L(x) = 1/[1 + exp ( − x)].
A logit model can be equivalently expressed as
log [odds(B_{i} = 1)] = (β_{0} + β_{1}X_{i})/σ_{i}
These models differ in several ways. The first and most obvious difference is that equation 1 specifies a continuous response Y as a linear function of X. Meanwhile, equation 2 specifies the probability of a binary response B taking the value 1 as a logistic function of X.
But the differences go beyond these models’ functional forms. The difference that’s most material for our concern with heteroskedasticity is how the variance parameter σ_{i}^{2} enters each equation. In the linear model, the variance enters the model through an additive error term ϵ. But, in the logit model, the variance enters the model directly as the denominator of the linear additive input to the logistic function.
The reason for this difference lies in where the stochastic component of each model originates. In a linear model, the stochastic element enters additively, while in the logit model the logitistic curve itself is an input to a stochastic Bernoulli function.
The practical implication of this difference is that the variance component of the model isn’t part of the identity of the β parameters in the linear model but is part of the identity of the βs in the logit model.
This is a problem for classic logit estimation if the observationspecific variance is nonconstant.
The reason for this is that classic logit relies on a maximum likelihood estimator (MLE) to identify the β parameters where the σ is assumed constant at σ_{i} = σ = 1 ∀ i.
This assumption is only appropriate if each σ_{i} = σ. When this isn’t the case, the MLE estimates of the β’s won’t be correctly identified.[2]
The Common but Notquiteright Solution
To deal with heteroskedasticity in nonlinear models like logit, it’s common (at least in my field of political science) to see researchers use a robust variancecovariance estimator for standard errors but do nothing to correct the parameter estimates themselves.
This choice is understandable. Most statistical software packages will allow users to get something like the robust variancecovariance matrix for OLS for their MLE estimates. And, most software packages will also compute and report the related test statistics and pvalues.
However, just because our statistical software lets us do something, that doesn’t mean we should.
The supposedly robust standard errors that statistical packages produce are problematic for a number of reasons, the first being that they reflect the variance of a parameter that has not itself be estimated well. Even more, these robust standard errors are computed using the gradients of the very likelihood function that fails to capture the nonconstant variance in the data.
How is it possible to generate correct standard errors from the gradients of a misspecified likelihood function?
The answer: it isn’t.
The Better Solution (but not a silver bullet)
To deal with heteroskedasticity in nonlinear models like logit there exist a class of heteroskedastic estimators that allow for explicitly modeling the variance component of the model.
Such estimators are not a perfect hedge against bias to be sure. However, in certain instances it is possible to recover both less biased and more efficient parameter estimates by taking nonconstant variance into account.
Consider, for instance, a scenario where the probability of a binary outcome is given as
Pr (B_{i} = 1) = L[(X_{i})/exp(X_{i})].
It specifies that the probability that the binary response B_{i} = 1 is not only a function of X_{i}, but also that the variance changes as a function of X_{i}—exp (X_{i}) essentially replaces σ_{i} in the specification.
To recover estimates of the relationship between X_{i} and the response, we would specify a heteroskedastic logit model as follows:
log [odds(B_{i} = 1)] = (β_{0} + β_{1}X_{i})/exp (γ_{0} + γ_{1}X_{i}).
Compare this to the classic logit specification, which just assumes σ_{i} = 1 ∀ i:
log [odds(B_{i} = 1)] = β_{0} + β_{1}X_{i}.
If we were to evaluate the performance of the MLE estimates for these logit models, we would find that the heteroskedastic logit is less biased and that its standard errors provide the appropriate coverage for confidence intervals, supporting more informative statistical inferences.
A Simulation
To quickly show the advantage conferred by heteroskedastic logit, we can do a quick simulation study in R.
We first need to attach some statistical packages that we’ll need to run the analysis.
# Install if not already #
# devtools::install_github("milesdwilliams15/seerrr")
# Packages Needed # Reason #
# ================ # ========================= #
library(seerrr) # for simulation #
library(tidyverse) # for grammar #
library(glmx) # for heteroskedastic logit #
library(kableExtra)# for making nice tables #
Next, we’ll start by iteratively simulating a datagenerating process.
# Simulate a d.g.p. 1,000 times:
set.seed(101010101) # setting seed for replicability
L < function(x) 1 / (1 + exp(x)) # logistic fun.
simulate(
N = 1000,
id_var = 1:N,
X = rnorm(N),
s = exp(X),
Y = rbinom(N, 1, L(X / s))
) > sim_data
The above randomly generates a list of datasets of size N = 1, 000 drawn from a d.g.p. that includes (1) a predictor variable X that is a normal random variable with mean 0 and standard deviation of 1 and (2) a binary response variable Y that takes the value 1 with probability L(X/exp (X).
With multiple samples drawn from the d.g.p., we can now apply our choice
of estimators to model Y as a function X. Below, I specify a classic
logit model and a heteroskedastic logit model. (See
documentation
for hetglm
in the glmx
package for more on syntax and usage.)
# classic logit
logit < function(...) glm(..., family = binomial)
estimate(
sim_data,
Y ~ X,
vars = "X",
estimator = logit,
) > classic_logit
# heteroskedastic logit
het_logit < function(...) hetglm(..., family = binomial) %>%
lmtest::coeftest()
estimate(
sim_data,
Y ~ X  X,
vars = "X",
estimator = het_logit
) > hetero_logit
The above generates estimates that we get using logit and heteroskedastic logit. We can evaluate these estimates as follows:
# evaluate classic logit
evaluate(
classic_logit, what = "bias", truth = 1
) > classic_logit_eval
# evaluate heteroskedastic logit
evaluate(
hetero_logit, what = "bias", truth = 1
) > hetero_logit_eval
And we can report the results by writing:
# comparison
bind_rows(
classic_logit_eval,
hetero_logit_eval
) %>%
bind_cols(
model = c("logit", "hetlogit"), .
) %>%
select(model, bias, coverage) %>%
kable(digits = 3)
model  bias  coverage 

logit  0.037  0.925 
hetlogit  0.005  0.960 
The heteroskedastic logit clearly outperforms standard logit, both in terms of bias and in terms of the coverage of the 95 percent confidence intervals (these should contain, or “cover”, the true parameter value 95 percent of the time).
Conclusion
This discussion has hardly exhausted the many threats to unbiased and efficient MLE estimation of nonlinear models. Other forms of unmeasured heterogeneity and even omitted variables (whether or not they are independent of a response and predictor of interest) can portend problems that generally don’t apply for linear models.
For this reason, care should be taken when using MLE for nonlinear regression estimation. Don’t just use logit for a binary response without interrogating the assumptions that underlie such a modeling choice.
Further Reading
I highly recommend reading Dave Giles’ blog post about this issue.
[1] For a nice summary, see Hayes and Cai (2007).
[2] The standard errors will also be inefficient since the standard errors are calculated directly from the Jacobian of the likelihood function.