Generalized Estimating Equations (GEE) in Python's Statsmodels

You’ve collected data from the same patients over multiple visits, or tracked students within schools over several years. Your dataset has that nested, clustered structure where observations aren’t truly independent. Standard regression methods assume independence, but you know better. That’s where Generalized Estimating Equations (GEE) come in.

GEE gives you a way to handle correlated data without making strict distributional assumptions. It’s designed for panel, cluster, or repeated measures data where observations may correlate within clusters but remain independent across clusters. Python’s statsmodels library implements GEE with a practical, straightforward API that lets you focus on your analysis rather than wrestling with the math.

Statsmodel Beginner’s Learning Path

What problems does GEE actually solve?

Traditional generalized linear models (GLMs) relate a dependent variable to predictors through a link function, but they assume observations are independent. When you have repeated measurements on the same subjects, measurements from students in the same classroom, or patients treated at the same hospital, that independence assumption breaks down.

GEE estimates population-averaged model parameters while accounting for within-cluster correlation. Instead of trying to model the correlation structure perfectly, GEE uses a “working” correlation structure. The beauty here is robustness: even if you misspecify the correlation, your parameter estimates remain consistent as long as your mean model is correct.

Think about a clinical trial tracking patient symptoms weekly over three months. Measurements from the same patient will naturally correlate. Week 1 and Week 2 measurements are more similar than Week 1 and Week 12. GEE handles this without requiring you to specify a complex likelihood function.

How do I fit a basic GEE model?

Let’s start with actual code. Here’s a complete example using epilepsy seizure data:

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load example dataset
data = sm.datasets.get_rdataset('epil', package='MASS').data

# Specify family and correlation structure
family = sm.families.Poisson()
correlation = sm.cov_struct.Exchangeable()

# Fit the model
model = smf.gee(
    "y ~ age + trt + base",
    "subject",
    data,
    cov_struct=correlation,
    family=family
)

results = model.fit()
print(results.summary())

This model analyzes seizure counts (a count outcome, hence Poisson family) with patient as the grouping variable. The exchangeable correlation structure assumes all measurements within a patient have the same correlation.

Which correlation structure should I choose?

This question trips up everyone starting with GEE. The independence structure assumes no correlation among outcomes from the same subject, while exchangeable assumes equivalent correlation among all outcomes from the same subject. An AR-1 (autoregressive) structure assumes correlation decreases as measurements get further apart in time.

Here’s how to think about it practically:

Independence: Use when clusters are small or you want conservative estimates. This essentially gives you GLM with robust standard errors.

correlation = sm.cov_struct.Independence()

Exchangeable: Best for data clustered by subject without a clear time ordering. Students within schools, patients within hospitals, measurements that don’t have a natural sequence.

correlation = sm.cov_struct.Exchangeable()

Autoregressive (AR-1): For time-ordered data where you expect correlation to decay over time. Weekly measurements, daily observations, anything with a temporal structure.

correlation = sm.cov_struct.Autoregressive()

The good news is that GEE estimates remain valid even if you misspecify the correlation structure, assuming the model is correct. Start with exchangeable, then check how coefficients and standard errors change with other structures. If changes are minimal, stick with the simpler structure.

What families and link functions work with GEE?

GEE supports the same one-parameter exponential families as GLMs. This means you can use:

Gaussian for continuous outcomes:

family = sm.families.Gaussian()

Binomial for binary or proportion data:

family = sm.families.Binomial()

Poisson for count data:

family = sm.families.Poisson()

Gamma for positive continuous data:

family = sm.families.Gamma()

Each family supports multiple link functions. Binomial works with logit, probit, and several others. Poisson typically uses log link. The statsmodels documentation lists which combinations are valid, but the defaults usually make sense.

How do I interpret GEE results?

Let’s look at a complete example with binary outcomes:

import pandas as pd
import numpy as np

# Create sample longitudinal data
np.random.seed(42)
n_subjects = 100
n_times = 4

data = pd.DataFrame({
    'subject': np.repeat(range(n_subjects), n_times),
    'time': np.tile(range(n_times), n_subjects),
    'treatment': np.repeat(np.random.binomial(1, 0.5, n_subjects), n_times),
    'age': np.repeat(np.random.normal(50, 10, n_subjects), n_times)
})

# Simulate outcome with some correlation
data['outcome'] = np.random.binomial(
    1,
    1 / (1 + np.exp(-(
        -2 + 
        0.5 * data['treatment'] + 
        0.02 * data['age'] + 
        0.1 * data['time']
    )))
)

# Fit model
family = sm.families.Binomial()
correlation = sm.cov_struct.Exchangeable()

model = smf.gee(
    "outcome ~ treatment + age + time",
    "subject",
    data,
    cov_struct=correlation,
    family=family
)

results = model.fit()
print(results.summary())

The output shows coefficient estimates, standard errors, z-statistics, and p-values. With a logit link (default for Binomial), coefficients represent log odds ratios. A coefficient of 0.5 for treatment means the treatment roughly multiplies the odds by exp(0.5) = 1.65.

The summary also reports:

Number of observations and clusters
Cluster sizes (min, max, mean)
The correlation structure used
Covariance type (robust by default)

What’s the difference between naive and robust standard errors?

The “robust” covariance type is the standard sandwich estimator and is the default in most packages. It protects you against misspecification of the working correlation structure. The “naive” estimator gives smaller standard errors but is only correct if the working correlation structure is correctly specified.

# Robust standard errors (default)
results_robust = model.fit(cov_type='robust')

# Naive standard errors
results_naive = model.fit(cov_type='naive')

Always use robust standard errors unless you’re absolutely certain about your correlation structure. The computational cost is negligible, and the protection against model misspecification is worth it.

How does GEE compare to mixed models?

GEE gives you population-averaged (marginal) effects. Mixed models give you subject-specific (conditional) effects. This matters for interpretation.

With GEE analyzing treatment effects in a logistic regression, you’re estimating the average treatment effect across the population. With a mixed model, you’re estimating the treatment effect for an individual with a specific random effect value.

For linear models, marginal and conditional mean structures are the same, though parameter estimates from GEE and mixed modeling won’t match exactly. For nonlinear models like logistic regression, marginal effects are typically smaller than conditional effects.

Choose GEE when:

You care about population averages
You want robust inference without distributional assumptions
Your clusters are large and numerous
You don’t need subject-specific predictions

Choose mixed models when:

You need subject-specific predictions
You want to quantify between-subject variability
You have nested random effects
You’re comfortable with distributional assumptions

Can I handle missing data with GEE?

GEE handles missing data through available case analysis. If a subject has some missing time points, GEE uses the observed data and accounts for the missing pattern in the correlation estimation.

model = smf.gee(
    "outcome ~ treatment + age",
    "subject",
    data,
    cov_struct=correlation,
    family=family,
    missing='drop'  # Drop observations with missing values
)

The missing parameter accepts ‘none’ (no checking), ‘drop’ (remove missing), or ‘raise’ (error on missing). The parameter estimates for correlation structures are computed using available nonmissing pairs in the appropriate moment estimators.

This approach assumes data are missing completely at random (MCAR). If missingness relates to unobserved outcomes (not at random), you need more sophisticated approaches like inverse probability weighting or multiple imputation.

What about convergence issues?

GEE uses an iterative algorithm that alternates between estimating mean parameters and updating the correlation structure. Sometimes this process doesn’t converge, especially with small samples or complex correlation structures.

When you hit convergence problems:

Simplify your correlation structure. Try independence or exchangeable instead of unstructured.
Check your data. Extremely sparse clusters or very unbalanced designs cause problems.
Adjust the maximum iterations:

results = model.fit(maxiter=100)

Try different starting values or scaling options.

Unlike maximum likelihood estimation, GEE iterations can stop at any point and still produce meaningful estimates. The algorithm doesn’t need to fully converge to give you usable results, though you should be more cautious about inference.

GEE gives you a practical middle ground for correlated data. It’s less complex than full mixed models but more appropriate than ignoring the correlation structure entirely. The statsmodels implementation makes it accessible, with sensible defaults and a familiar formula interface. Start simple with your correlation structure, use robust standard errors, and focus on getting the mean model right. The rest usually takes care of itself.

For more technical details, check the statsmodels GEE documentation and the foundational papers by Liang and Zeger (1986) on longitudinal data analysis using generalized linear models.

Generalized Estimating Equations (GEE) in Python’s Statsmodels

Google’s SynthID is supposed to find fake AI images. But it failed when it mattered most.

How to Use ChatGPT as Your Stock Analyst ($NVDA)

This Finviz Screener Finds Recession-Proof Stocks – Four Variables Suggested by AI