Statsmodels organizes its functionality into topic-based subpackages rather than dumping everything into a single namespace. Understanding this structure helps you find the right models quickly and import them efficiently.

The library provides two primary access points: statsmodels.api for general use and statsmodels.formula.api for R-style formula syntax. Beyond these, specialized subpackages contain models, tools, and functions organized by statistical domain.

How the API structure works

When you import statsmodels.api, you’re not loading the entire library. The API module collects the most commonly used classes and functions from various subpackages and presents them through a clean interface.

Standard import convention:

import statsmodels.api as sm
import statsmodels.formula.api as smf

These imports give you access to regression models, GLMs, time series tools, and statistical tests without navigating the full directory structure. The API makes the most useful items available within one or two attribute levels.

What the API includes:

  • Core regression models (OLS, WLS, GLS, GLSAR)
  • Generalized linear models and families
  • Discrete choice models (Logit, Probit, Poisson)
  • Robust regression (RLM)
  • Mixed effects models (MixedLM)
  • Time series analysis classes
  • Statistical tests and diagnostic tools
  • Datasets for examples and testing

The API doesn’t include every function in the library. Specialized features often require direct imports from their subpackages.

Core subpackages and their purposes

Statsmodels groups related functionality into subpackages that match statistical modeling categories. Each subpackage focuses on a specific type of analysis.

regression

Contains linear regression models and related tools.

Main classes:

  • OLS: Ordinary Least Squares regression
  • WLS: Weighted Least Squares
  • GLS: Generalized Least Squares
  • GLSAR: GLS with autoregressive errors
  • QuantReg: Quantile regression
  • RecursiveLS: Recursive least squares
  • MixedLM: Mixed effects linear models

Common usage:

from statsmodels.regression.linear_model import OLS
# or through the API
import statsmodels.api as sm
model = sm.OLS(y, X)

The regression subpackage provides the foundation for most statistical modeling work. These models handle continuous response variables with various assumptions about error structure.

genmod (Generalized Models)

Handles generalized linear models where the response variable follows distributions other than normal.

Main classes:

  • GLM: Generalized Linear Model
  • GEE: Generalized Estimating Equations
  • families: Distribution families (Gaussian, Binomial, Poisson, etc.)
  • cov_struct: Covariance structures for GEE

Distribution families available:

  • Binomial for binary and proportion data
  • Poisson for count data
  • Gamma for positive continuous data
  • Inverse Gaussian for skewed positive data
  • Negative Binomial for overdispersed counts
import statsmodels.api as sm
model = sm.GLM(y, X, family=sm.families.Binomial())

GLMs extend linear regression to handle different response distributions while maintaining the linear predictor framework.

discrete

Focuses on models where the dependent variable is categorical or count-based.

Main classes:

  • Logit: Binary logistic regression
  • Probit: Binary probit regression
  • MNLogit: Multinomial logit for unordered categories
  • Poisson: Poisson regression for count data
  • NegativeBinomial: For overdispersed count data
  • ZeroInflatedPoisson: Handles excess zeros in count data
  • ZeroInflatedNegativeBinomial: Combines zero inflation with overdispersion
from statsmodels.discrete.discrete_model import Logit
# or
import statsmodels.api as sm
model = sm.Logit(y, X)

Discrete models handle situations where your outcome isn’t continuous. Use these for classification, count analysis, or choice modeling.

tsa (Time Series Analysis)

Comprehensive time series modeling toolkit with univariate and multivariate methods.

Main classes and modules:

  • AR: Autoregressive models
  • ARIMA: Autoregressive Integrated Moving Average
  • SARIMAX: Seasonal ARIMA with exogenous variables
  • VAR: Vector Autoregression for multiple time series
  • stattools: ACF, PACF, stationarity tests
  • filters: Time series filtering methods
  • seasonal: Seasonal decomposition (STL)
from statsmodels.tsa.arima.model import ARIMA
# or
import statsmodels.tsa.api as tsa

Time series analysis in Statsmodels covers forecasting, trend analysis, seasonal patterns, and multivariate dynamics. The stattools module provides diagnostic functions for checking stationarity and identifying appropriate model orders.

robust

Implements regression methods that handle outliers and violations of normality assumptions.

Main class:

  • RLM: Robust Linear Model using M-estimators
from statsmodels.robust.robust_linear_model import RLM
# or
import statsmodels.api as sm
model = sm.RLM(y, X)

Robust regression automatically downweights observations with large residuals, preventing outliers from dominating your parameter estimates.

stats

Statistical tests, diagnostics, and hypothesis testing tools.

Submodules:

  • diagnostic: Heteroscedasticity tests, autocorrelation tests, normality tests
  • outliers_influence: Leverage, Cook’s distance, influence measures
  • multitest: Multiple testing corrections
  • proportion: Tests for proportions
  • weightstats: Weighted statistics and tests
  • anova: ANOVA tables and methods
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

The stats subpackage provides the diagnostic tools you need to validate model assumptions and detect problems in your analysis.

nonparametric

Non-parametric methods that don’t assume specific functional forms or distributions.

Functionality:

  • Kernel density estimation
  • Kernel regression (local polynomial fitting)
  • Smoothing splines
  • Non-parametric hypothesis tests
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_regression import KernelReg

Use these when your relationships aren’t linear or you want to explore data structure without imposing parametric assumptions.

duration

Survival analysis and duration models.

Classes for:

  • Proportional hazards models
  • Survival functions
  • Hazard estimation

Useful when analyzing time-to-event data like customer churn, equipment failure, or time until a specific outcome.

multivariate

Multivariate statistical methods for analyzing multiple dependent variables simultaneously.

Models:

  • Factor analysis
  • MANOVA (Multivariate ANOVA)
  • Canonical correlation

Supporting modules

Several modules provide infrastructure and utilities that support model fitting and analysis.

tools

Helper functions for data manipulation, matrix operations, and model utilities.

Common functions:

  • add_constant: Adds intercept column to design matrix
  • categorical: Creates dummy variables from categorical data
  • Data validation and transformation utilities
import statsmodels.api as sm
X_with_const = sm.add_constant(X)

You’ll use add_constant frequently since most models need an intercept term but don’t add it automatically.

datasets

Built-in datasets for learning, examples, and testing.

import statsmodels.api as sm
data = sm.datasets.get_rdataset('mtcars', 'datasets')
spector = sm.datasets.spector.load_pandas()

The datasets module includes classic econometric and statistical datasets. Each dataset comes with documentation describing its source and variables.

graphics

Plotting functions for statistical graphics and model diagnostics.

Available plots:

  • Regression plots with confidence bands
  • Partial regression plots
  • Influence plots
  • CCPR plots (component plus residual)
  • Time series plots
from statsmodels.graphics.regressionplots import plot_leverage_resid2
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

These plotting functions integrate with Matplotlib to create publication-quality statistical graphics.

iolib

Input/output tools for reading data from various formats.

  • Reading Stata .dta files
  • Table formatting for results
  • Summary table creation

Useful when you’re working with data from statistical software like Stata or need to export formatted tables.

Formula API vs array API

Statsmodels provides two ways to specify models, each suited to different workflows.

Array API

Uses NumPy arrays or Pandas DataFrames directly. You construct your design matrix manually.

import statsmodels.api as sm
import numpy as np

X = np.random.rand(100, 3)
y = np.random.rand(100)
X = sm.add_constant(X)  # Add intercept

model = sm.OLS(y, X).fit()

The array API gives you complete control over the design matrix. Use this when you need custom transformations or when you’re building programmatic workflows.

Formula API

Uses R-style formulas to specify models. The library handles dummy variable creation, interactions, and transformations automatically.

import statsmodels.formula.api as smf
import pandas as pd

# Data must be in a DataFrame
data = pd.DataFrame({
    'y': np.random.rand(100),
    'x1': np.random.rand(100),
    'x2': np.random.rand(100),
    'category': np.random.choice(['A', 'B', 'C'], 100)
})

model = smf.ols('y ~ x1 + x2 + C(category)', data=data).fit()

The formula interface is cleaner for exploratory analysis and when working with DataFrames. Categorical variables get automatic dummy coding. Transformations like np.log() work directly in formulas.

Import strategies

Different import patterns work better for different situations.

For interactive analysis:

import statsmodels.api as sm
import statsmodels.formula.api as smf

This gives you access to commonly used models with minimal typing.

For production code:

from statsmodels.regression.linear_model import OLS
from statsmodels.stats.diagnostic import het_breuschpagan

Direct imports make dependencies explicit and avoid loading unnecessary modules.

For subpackage-specific work:

import statsmodels.tsa.api as tsa
model = tsa.SARIMAX(data, order=(1,1,1))

Time series, statistics, and graphics subpackages have their own API modules for focused work.

The sandbox

Statsmodels includes a sandbox directory containing experimental code that isn’t considered production-ready. Features in the sandbox might have incomplete testing, limited documentation, or unstable APIs.

Sandbox contents:

  • Experimental regression methods
  • Developmental time series models
  • Prototype statistical tests
  • Research-stage implementations

Avoid using sandbox code in production unless you’re willing to maintain it yourself if the API changes. Features eventually migrate from sandbox to main subpackages once they’re stable.

Navigating the documentation

The official Statsmodels documentation organizes material differently than the code structure.

Documentation structure:

  • User Guide: Topic-based tutorials and explanations
  • API Reference: Complete class and function listings
  • Examples: Jupyter notebooks demonstrating workflows

Finding what you need:

If you know the model type (linear regression, logistic regression, ARIMA), check the User Guide for the relevant section. It explains when to use each model and shows worked examples.

For specific function signatures and parameters, use the API Reference. It lists every public class and function with complete parameter documentation.

Practical examples of structure usage

Linear regression with diagnostics:

import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white

# Fit model
model = sm.OLS(y, X).fit()

# Run diagnostics from stats subpackage
white_test = het_white(model.resid, model.model.exog)
print(f"White test p-value: {white_test[1]}")

Time series with custom imports:

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller

# Check stationarity
adf_result = adfuller(timeseries)

# Fit ARIMA
model = ARIMA(timeseries, order=(1,1,1)).fit()

Formula-based GLM with custom family:

import statsmodels.formula.api as smf
import statsmodels.api as sm

model = smf.glm(
    'count ~ treatment + age',
    data=df,
    family=sm.families.NegativeBinomial()
).fit()

Understanding module organization benefits

The topic-based structure prevents namespace pollution and enables efficient imports. You only load what you actually need for your analysis.

Subpackages make the codebase maintainable. Developers working on time series don’t affect regression code. New models get added to appropriate subpackages without disrupting existing functionality.

For users, the organization mirrors how statisticians think about methods. If you need discrete choice modeling, everything relevant lives in the discrete subpackage. Time series tools cluster together in tsa.

The dual API design balances convenience and explicitness. Import statsmodels.api for quick interactive work. Use direct imports for production code where you want crystal-clear dependencies.

This structure reflects a mature library designed for serious statistical work rather than a monolithic catch-all package.

Share.
Leave A Reply