Back to Blog

I know of few political scientists, economists, or other quantitative researchers who don’t use some form of robust standard errors when estimating linear regression models via OLS.


Because only under the most ideal and stringent of circumstances will classic OLS standard errors be efficient.[1]

Efficiency requires that the variance of the regression error is constant and that individual observations are independent of each other. In practice this usually never is the case. Instead, the errors may be heteroskedastic (having non-constant variance) or may not all be independent (they may be clustered).

Let’s zero in on the problem of non-constant variance.

When it comes to linear models, the solution to this problem is simple: just use a robust estimator for the standard errors (there are many to choose from). Such estimators take into account heterogeneity in the errors and thus provide more consistent standard errors and more appropriate statistical inferences.

But, for nonlinear models such as logit, the problem is a little more insidious, and, therefore, it requires doing more than simply using robust standard errors.

The Problem

To illustrate why the problem of heteroskedasticity has different implications between linear and nonlinear models, let’s take a quick look at a pair of regressions:

  1. Yi = β0 + β1Xi + ϵi, ϵi ∼ 𝒩(0,σi2);

  2. Pr (Bi = 1) = L[(β0+β1Xi)/σi].

Equation 1 is a linear model and equation 2 is a logit model, where the function L( ⋅ ) is just the logistic function:

L(x) = 1/[1 + exp ( − x)].

A logit model can be equivalently expressed as

log [odds(Bi = 1)] = (β0 + β1Xi)/σi

These models differ in several ways. The first and most obvious difference is that equation 1 specifies a continuous response Y as a linear function of X. Meanwhile, equation 2 specifies the probability of a binary response B taking the value 1 as a logistic function of X.

But the differences go beyond these models’ functional forms. The difference that’s most material for our concern with heteroskedasticity is how the variance parameter σi2 enters each equation. In the linear model, the variance enters the model through an additive error term ϵ. But, in the logit model, the variance enters the model directly as the denominator of the linear additive input to the logistic function.

The reason for this difference lies in where the stochastic component of each model originates. In a linear model, the stochastic element enters additively, while in the logit model the logitistic curve itself is an input to a stochastic Bernoulli function.

The practical implication of this difference is that the variance component of the model isn’t part of the identity of the β parameters in the linear model but is part of the identity of the βs in the logit model.

This is a problem for classic logit estimation if the observation-specific variance is non-constant.

The reason for this is that classic logit relies on a maximum likelihood estimator (MLE) to identify the β parameters where the σ is assumed constant at σi = σ = 1 ∀ i.

This assumption is only appropriate if each σi = σ. When this isn’t the case, the MLE estimates of the β’s won’t be correctly identified.[2]

The Common but Not-quite-right Solution

To deal with heteroskedasticity in nonlinear models like logit, it’s common (at least in my field of political science) to see researchers use a robust variance-covariance estimator for standard errors but do nothing to correct the parameter estimates themselves.

This choice is understandable. Most statistical software packages will allow users to get something like the robust variance-covariance matrix for OLS for their MLE estimates. And, most software packages will also compute and report the related test statistics and p-values.

However, just because our statistical software lets us do something, that doesn’t mean we should.

The supposedly robust standard errors that statistical packages produce are problematic for a number of reasons, the first being that they reflect the variance of a parameter that has not itself be estimated well. Even more, these robust standard errors are computed using the gradients of the very likelihood function that fails to capture the non-constant variance in the data.

How is it possible to generate correct standard errors from the gradients of a mis-specified likelihood function?

The answer: it isn’t.

The Better Solution (but not a silver bullet)

To deal with heteroskedasticity in nonlinear models like logit there exist a class of heteroskedastic estimators that allow for explicitly modeling the variance component of the model.

Such estimators are not a perfect hedge against bias to be sure. However, in certain instances it is possible to recover both less biased and more efficient parameter estimates by taking non-constant variance into account.

Consider, for instance, a scenario where the probability of a binary outcome is given as

Pr (Bi = 1) = L[(Xi)/exp(Xi)].

It specifies that the probability that the binary response Bi = 1 is not only a function of Xi, but also that the variance changes as a function of Xi—exp (Xi) essentially replaces σi in the specification.

To recover estimates of the relationship between Xi and the response, we would specify a heteroskedastic logit model as follows:

log [odds(Bi = 1)] = (β0 + β1Xi)/exp (γ0 + γ1Xi).

Compare this to the classic logit specification, which just assumes σi = 1 ∀ i:

log [odds(Bi = 1)] = β0 + β1Xi.

If we were to evaluate the performance of the MLE estimates for these logit models, we would find that the heteroskedastic logit is less biased and that its standard errors provide the appropriate coverage for confidence intervals, supporting more informative statistical inferences.

A Simulation

To quickly show the advantage conferred by heteroskedastic logit, we can do a quick simulation study in R.

We first need to attach some statistical packages that we’ll need to run the analysis.

# Install if not already #
# devtools::install_github("milesdwilliams15/seerrr")

# Packages Needed  # Reason                    #
# ================ # ========================= #
library(seerrr)    # for simulation            #
library(tidyverse) # for grammar               #
library(glmx)      # for heteroskedastic logit #
library(kableExtra)# for making nice tables    #

Next, we’ll start by iteratively simulating a data-generating process.

# Simulate a d.g.p. 1,000 times:
set.seed(101010101) # setting seed for replicability
L <- function(x) 1 / (1 + exp(-x)) # logistic fun.
  N = 1000,
  id_var = 1:N,
  X = rnorm(N),
  s = exp(X),
  Y = rbinom(N, 1, L(X / s))
) -> sim_data

The above randomly generates a list of datasets of size N = 1, 000 drawn from a d.g.p. that includes (1) a predictor variable X that is a normal random variable with mean 0 and standard deviation of 1 and (2) a binary response variable Y that takes the value 1 with probability L(X/exp (X).

With multiple samples drawn from the d.g.p., we can now apply our choice of estimators to model Y as a function X. Below, I specify a classic logit model and a heteroskedastic logit model. (See documentation for hetglm in the glmx package for more on syntax and usage.)

# classic logit
logit <- function(...) glm(..., family = binomial)
  Y ~ X,
  vars = "X",
  estimator = logit,
) -> classic_logit

# heteroskedastic logit
het_logit <- function(...) hetglm(..., family = binomial) %>%
  Y ~ X | X,
  vars = "X",
  estimator = het_logit
) -> hetero_logit

The above generates estimates that we get using logit and heteroskedastic logit. We can evaluate these estimates as follows:

# evaluate classic logit
  classic_logit, what = "bias", truth = 1
) -> classic_logit_eval

# evaluate heteroskedastic logit
  hetero_logit, what = "bias", truth = 1
) -> hetero_logit_eval

And we can report the results by writing:

# comparison
) %>%
    model = c("logit", "het-logit"), .
  ) %>%
  select(model, bias, coverage) %>%
  kable(digits = 3)
model bias coverage
logit -0.037 0.925
het-logit 0.005 0.960

The heteroskedastic logit clearly outperforms standard logit, both in terms of bias and in terms of the coverage of the 95 percent confidence intervals (these should contain, or “cover”, the true parameter value 95 percent of the time).


This discussion has hardly exhausted the many threats to unbiased and efficient MLE estimation of nonlinear models. Other forms of unmeasured heterogeneity and even omitted variables (whether or not they are independent of a response and predictor of interest) can portend problems that generally don’t apply for linear models.

For this reason, care should be taken when using MLE for nonlinear regression estimation. Don’t just use logit for a binary response without interrogating the assumptions that underlie such a modeling choice.

Further Reading

I highly recommend reading Dave Giles’ blog post about this issue.

Back to Blog

[1] For a nice summary, see Hayes and Cai (2007).

[2] The standard errors will also be inefficient since the standard errors are calculated directly from the Jacobian of the likelihood function.