Statsmodels Generalized Linear Models - AskPython

You’ve probably hit a point where linear regression feels too simple for your data. Maybe you’re working with count data that can’t be negative, or binary outcomes where predictions need to stay between 0 and 1. This is where Generalized Linear Models come in.

I spent years forcing data into ordinary least squares before realizing GLMs handle these situations naturally. The statsmodels library in Python makes this accessible without needing to switch to R or deal with academic textbooks that assume you already know everything.

Statsmodel Beginner’s Learning Path

What are Generalized Linear Models and when should you use them?

Generalized Linear Models extend regular linear regression to handle more complex scenarios. While standard linear regression assumes your outcome is continuous with constant variance, GLMs relax these assumptions through two key components: a distribution family and a link function.

GLMs support estimation using one-parameter exponential families, which includes distributions like Gaussian (normal), Binomial, Poisson, and Gamma. The link function connects your linear predictors to the expected value of your outcome variable.

Think of it this way: you have website visitors (predictor) and conversions (outcome). Linear regression might predict 1.3 conversions or negative values, which makes no sense. A binomial GLM with logit link keeps predictions between 0 and 1, representing probability.

The basic structure looks like this:

import statsmodels.api as sm
import pandas as pd

# Your data setup
X = sm.add_constant(df[['feature1', 'feature2']])
y = df['outcome']

# Create and fit GLM
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
print(results.summary())

Note that statsmodels doesn’t automatically include an intercept, so you need to add one using add_constant.

How do you choose the right distribution family?

The distribution family depends on your outcome variable’s nature. Here’s the practical breakdown:

For count data (website visits, customer complaints, defects): Use Poisson or Negative Binomial families. Count data consists of non-negative integers representing events that occur at a certain rate.

# Poisson regression for count data
import numpy as np

# Generate sample count data
np.random.seed(42)
df = pd.DataFrame({
    'ad_spend': np.random.uniform(100, 1000, 100),
    'season': np.random.choice(['winter', 'summer'], 100),
    'visits': np.random.poisson(50, 100)
})

# Add constant and fit
X = pd.get_dummies(df[['ad_spend', 'season']], drop_first=True)
X = sm.add_constant(X)
y = df['visits']

poisson_model = sm.GLM(y, X, family=sm.families.Poisson())
poisson_results = poisson_model.fit()

The Poisson family uses a log link function by default, which ensures predictions stay positive.

For binary outcomes (yes/no, converted/didn’t convert): Use Binomial family with logit link.

# Logistic regression setup
df = pd.DataFrame({
    'income': np.random.uniform(30000, 150000, 200),
    'age': np.random.randint(18, 70, 200),
    'purchased': np.random.binomial(1, 0.3, 200)
})

X = sm.add_constant(df[['income', 'age']])
y = df['purchased']

logit_model = sm.GLM(y, X, family=sm.families.Binomial())
logit_results = logit_model.fit()

For continuous positive data (insurance claims, transaction amounts): Gamma family works well.

# Gamma regression for positive continuous outcomes
df = pd.DataFrame({
    'experience_years': np.random.randint(0, 20, 100),
    'education_level': np.random.choice([1, 2, 3], 100),
    'salary': np.random.gamma(2, 20000, 100)
})

X = sm.add_constant(df[['experience_years', 'education_level']])
y = df['salary']

gamma_model = sm.GLM(y, X, family=sm.families.Gamma())
gamma_results = gamma_model.fit()

What about the formula-based approach?

You can describe models using R-style formulas or arrays. The formula approach is cleaner when you’re dealing with categorical variables.

from statsmodels.formula.api import glm

# Formula-based model specification
formula_model = glm(
    formula='visits ~ ad_spend + C(season)',
    data=df,
    family=sm.families.Poisson()
)
formula_results = formula_model.fit()

The C() wrapper treats variables as categorical. This automatically creates dummy variables without manual encoding.

How do you interpret the results?

The summary output gives you coefficient estimates, standard errors, z-scores, and p-values. But interpretation depends on your link function.

For Poisson regression with log link:

print(poisson_results.summary())

# Get coefficients
coefficients = poisson_results.params
print(f"Coefficients:\n{coefficients}")

# Exponentiate for multiplicative effects
exp_coef = np.exp(coefficients)
print(f"\nExponentiated coefficients:\n{exp_coef}")

An exponentiated coefficient of 1.05 for ad_spend means a one-unit increase in ad spend multiplies expected visits by 1.05 (a 5% increase).

For logistic regression:

# Get odds ratios
odds_ratios = np.exp(logit_results.params)
print(f"Odds ratios:\n{odds_ratios}")

An odds ratio of 1.02 for income means each dollar increase in income multiplies the odds of purchase by 1.02.

What should you check for model quality?

The summary method shows goodness-of-fit statistics like deviance and Pearson chi-squared. For Poisson models, a common failure point is overdispersion where variance exceeds the mean.

# Check for overdispersion
print(f"Deviance: {poisson_results.deviance}")
print(f"Pearson chi2: {poisson_results.pearson_chi2}")
print(f"DF Residuals: {poisson_results.df_resid}")

# Dispersion parameter (should be close to 1 for Poisson)
dispersion = poisson_results.pearson_chi2 / poisson_results.df_resid
print(f"Dispersion: {dispersion}")

If dispersion is much larger than 1, consider using Negative Binomial instead:

from statsmodels.discrete.discrete_model import NegativeBinomial

nb_model = NegativeBinomial(y, X)
nb_results = nb_model.fit()

How do you make predictions?

Generate predictions on new data using the predict method:

# Create new data
new_data = pd.DataFrame({
    'ad_spend': [500, 750, 1000],
    'season_summer': [1, 0, 1]
})
new_data = sm.add_constant(new_data)

# Generate predictions
predictions = poisson_results.predict(new_data)
print(f"Predicted visits: {predictions.values}")

# Get confidence intervals
prediction_summary = poisson_results.get_prediction(new_data)
conf_int = prediction_summary.summary_frame()
print(conf_int)

For binary outcomes, predictions represent probabilities:

new_customers = pd.DataFrame({
    'income': [50000, 100000, 150000],
    'age': [25, 40, 55]
})
new_customers = sm.add_constant(new_customers)

purchase_prob = logit_results.predict(new_customers)
print(f"Purchase probabilities: {purchase_prob.values}")

What are common mistakes to avoid?

First, forgetting to add a constant term when using array-based specification. This omits the intercept and skews everything.

Second, choosing the wrong family. Binary data needs Binomial, not Gaussian. Count data needs Poisson, not linear regression. Ordinary Least Squares Regression may not be appropriate for count data as it works best on real numbers.

Third, ignoring overdispersion in Poisson models. Always check dispersion and switch to Negative Binomial if needed.

Fourth, misinterpreting coefficients. Remember that coefficients are on the link scale. You need to transform them back for meaningful interpretation.

How does this compare to scikit-learn?

Scikit-learn is designed for machine learning, while statsmodels is made for rigorous statistics. Use scikit-learn when prediction accuracy matters most. Use statsmodels when you need inference, hypothesis testing, and detailed statistical diagnostics.

Statsmodels gives you p-values, confidence intervals, and goodness-of-fit measures. Scikit-learn focuses on cross-validation, regularization, and prediction performance.

Both have their place. For exploratory analysis and understanding relationships, statsmodels wins. For production prediction systems, scikit-learn often makes more sense.

Generalized Linear Models solve real problems that basic regression can’t handle. The statsmodels implementation gives you the statistical rigor to understand what’s actually happening in your data, not just make predictions. Start with the distribution that matches your outcome type, check your diagnostics, and interpret results in context.

For more detailed information, check out the official statsmodels GLM documentation and this practical guide to Poisson regression.

Statsmodels Generalized Linear Models – AskPython

Google’s SynthID is supposed to find fake AI images. But it failed when it mattered most.

How to Use ChatGPT as Your Stock Analyst ($NVDA)

This Finviz Screener Finds Recession-Proof Stocks – Four Variables Suggested by AI