Statsmodels Library Structure and Subpackages

Statsmodels organizes its functionality into topic-based subpackages rather than dumping everything into a single namespace. Understanding this structure helps you find the right models quickly and import them efficiently.

The library provides two primary access points: statsmodels.api for general use and statsmodels.formula.api for R-style formula syntax. Beyond these, specialized subpackages contain models, tools, and functions organized by statistical domain.

Statsmodel Beginner’s Learning Path

How the API structure works

When you import statsmodels.api, you’re not loading the entire library. The API module collects the most commonly used classes and functions from various subpackages and presents them through a clean interface.

Standard import convention:

import statsmodels.api as sm
import statsmodels.formula.api as smf

These imports give you access to regression models, GLMs, time series tools, and statistical tests without navigating the full directory structure. The API makes the most useful items available within one or two attribute levels.

What the API includes:

Core regression models (OLS, WLS, GLS, GLSAR)
Generalized linear models and families
Discrete choice models (Logit, Probit, Poisson)
Robust regression (RLM)
Mixed effects models (MixedLM)
Time series analysis classes
Statistical tests and diagnostic tools
Datasets for examples and testing

The API doesn’t include every function in the library. Specialized features often require direct imports from their subpackages.

Core subpackages and their purposes

Statsmodels groups related functionality into subpackages that match statistical modeling categories. Each subpackage focuses on a specific type of analysis.

regression

Contains linear regression models and related tools.

Main classes:

OLS: Ordinary Least Squares regression
WLS: Weighted Least Squares
GLS: Generalized Least Squares
GLSAR: GLS with autoregressive errors
QuantReg: Quantile regression
RecursiveLS: Recursive least squares
MixedLM: Mixed effects linear models

Common usage:

from statsmodels.regression.linear_model import OLS
# or through the API
import statsmodels.api as sm
model = sm.OLS(y, X)

The regression subpackage provides the foundation for most statistical modeling work. These models handle continuous response variables with various assumptions about error structure.

genmod (Generalized Models)

Handles generalized linear models where the response variable follows distributions other than normal.

Main classes:

GLM: Generalized Linear Model
GEE: Generalized Estimating Equations
families: Distribution families (Gaussian, Binomial, Poisson, etc.)
cov_struct: Covariance structures for GEE

Distribution families available:

Binomial for binary and proportion data
Poisson for count data
Gamma for positive continuous data
Inverse Gaussian for skewed positive data
Negative Binomial for overdispersed counts

import statsmodels.api as sm
model = sm.GLM(y, X, family=sm.families.Binomial())

GLMs extend linear regression to handle different response distributions while maintaining the linear predictor framework.

discrete

Focuses on models where the dependent variable is categorical or count-based.

Main classes:

Logit: Binary logistic regression
Probit: Binary probit regression
MNLogit: Multinomial logit for unordered categories
Poisson: Poisson regression for count data
NegativeBinomial: For overdispersed count data
ZeroInflatedPoisson: Handles excess zeros in count data
ZeroInflatedNegativeBinomial: Combines zero inflation with overdispersion

from statsmodels.discrete.discrete_model import Logit
# or
import statsmodels.api as sm
model = sm.Logit(y, X)

Discrete models handle situations where your outcome isn’t continuous. Use these for classification, count analysis, or choice modeling.

tsa (Time Series Analysis)

Comprehensive time series modeling toolkit with univariate and multivariate methods.

Main classes and modules:

AR: Autoregressive models
ARIMA: Autoregressive Integrated Moving Average
SARIMAX: Seasonal ARIMA with exogenous variables
VAR: Vector Autoregression for multiple time series
stattools: ACF, PACF, stationarity tests
filters: Time series filtering methods
seasonal: Seasonal decomposition (STL)

from statsmodels.tsa.arima.model import ARIMA
# or
import statsmodels.tsa.api as tsa

Time series analysis in Statsmodels covers forecasting, trend analysis, seasonal patterns, and multivariate dynamics. The stattools module provides diagnostic functions for checking stationarity and identifying appropriate model orders.

robust

Implements regression methods that handle outliers and violations of normality assumptions.

Main class:

RLM: Robust Linear Model using M-estimators

from statsmodels.robust.robust_linear_model import RLM
# or
import statsmodels.api as sm
model = sm.RLM(y, X)

Robust regression automatically downweights observations with large residuals, preventing outliers from dominating your parameter estimates.

stats

Statistical tests, diagnostics, and hypothesis testing tools.

Submodules:

diagnostic: Heteroscedasticity tests, autocorrelation tests, normality tests
outliers_influence: Leverage, Cook’s distance, influence measures
multitest: Multiple testing corrections
proportion: Tests for proportions
weightstats: Weighted statistics and tests
anova: ANOVA tables and methods

from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

The stats subpackage provides the diagnostic tools you need to validate model assumptions and detect problems in your analysis.

nonparametric

Non-parametric methods that don’t assume specific functional forms or distributions.

Functionality:

Kernel density estimation
Kernel regression (local polynomial fitting)
Smoothing splines
Non-parametric hypothesis tests

from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_regression import KernelReg

Use these when your relationships aren’t linear or you want to explore data structure without imposing parametric assumptions.

duration

Survival analysis and duration models.

Classes for:

Proportional hazards models
Survival functions
Hazard estimation

Useful when analyzing time-to-event data like customer churn, equipment failure, or time until a specific outcome.

multivariate

Multivariate statistical methods for analyzing multiple dependent variables simultaneously.

Models:

Factor analysis
MANOVA (Multivariate ANOVA)
Canonical correlation

Supporting modules

Several modules provide infrastructure and utilities that support model fitting and analysis.

tools

Helper functions for data manipulation, matrix operations, and model utilities.

Common functions:

add_constant: Adds intercept column to design matrix
categorical: Creates dummy variables from categorical data
Data validation and transformation utilities

import statsmodels.api as sm
X_with_const = sm.add_constant(X)

You’ll use add_constant frequently since most models need an intercept term but don’t add it automatically.

datasets

Built-in datasets for learning, examples, and testing.

import statsmodels.api as sm
data = sm.datasets.get_rdataset('mtcars', 'datasets')
spector = sm.datasets.spector.load_pandas()

The datasets module includes classic econometric and statistical datasets. Each dataset comes with documentation describing its source and variables.

graphics

Plotting functions for statistical graphics and model diagnostics.

Available plots:

Regression plots with confidence bands
Partial regression plots
Influence plots
CCPR plots (component plus residual)
Time series plots

from statsmodels.graphics.regressionplots import plot_leverage_resid2
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

These plotting functions integrate with Matplotlib to create publication-quality statistical graphics.

iolib

Input/output tools for reading data from various formats.

Reading Stata .dta files
Table formatting for results
Summary table creation

Useful when you’re working with data from statistical software like Stata or need to export formatted tables.

Formula API vs array API

Statsmodels provides two ways to specify models, each suited to different workflows.

Array API

Uses NumPy arrays or Pandas DataFrames directly. You construct your design matrix manually.

import statsmodels.api as sm
import numpy as np

X = np.random.rand(100, 3)
y = np.random.rand(100)
X = sm.add_constant(X)  # Add intercept

model = sm.OLS(y, X).fit()

The array API gives you complete control over the design matrix. Use this when you need custom transformations or when you’re building programmatic workflows.

Formula API

Uses R-style formulas to specify models. The library handles dummy variable creation, interactions, and transformations automatically.

import statsmodels.formula.api as smf
import pandas as pd

# Data must be in a DataFrame
data = pd.DataFrame({
    'y': np.random.rand(100),
    'x1': np.random.rand(100),
    'x2': np.random.rand(100),
    'category': np.random.choice(['A', 'B', 'C'], 100)
})

model = smf.ols('y ~ x1 + x2 + C(category)', data=data).fit()

The formula interface is cleaner for exploratory analysis and when working with DataFrames. Categorical variables get automatic dummy coding. Transformations like np.log() work directly in formulas.

Import strategies

Different import patterns work better for different situations.

For interactive analysis:

import statsmodels.api as sm
import statsmodels.formula.api as smf

This gives you access to commonly used models with minimal typing.

For production code:

from statsmodels.regression.linear_model import OLS
from statsmodels.stats.diagnostic import het_breuschpagan

Direct imports make dependencies explicit and avoid loading unnecessary modules.

For subpackage-specific work:

import statsmodels.tsa.api as tsa
model = tsa.SARIMAX(data, order=(1,1,1))

Time series, statistics, and graphics subpackages have their own API modules for focused work.

The sandbox

Statsmodels includes a sandbox directory containing experimental code that isn’t considered production-ready. Features in the sandbox might have incomplete testing, limited documentation, or unstable APIs.

Sandbox contents:

Experimental regression methods
Developmental time series models
Prototype statistical tests
Research-stage implementations

Avoid using sandbox code in production unless you’re willing to maintain it yourself if the API changes. Features eventually migrate from sandbox to main subpackages once they’re stable.

Navigating the documentation

The official Statsmodels documentation organizes material differently than the code structure.

Documentation structure:

User Guide: Topic-based tutorials and explanations
API Reference: Complete class and function listings
Examples: Jupyter notebooks demonstrating workflows

Finding what you need:

If you know the model type (linear regression, logistic regression, ARIMA), check the User Guide for the relevant section. It explains when to use each model and shows worked examples.

For specific function signatures and parameters, use the API Reference. It lists every public class and function with complete parameter documentation.

Practical examples of structure usage

Linear regression with diagnostics:

import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white

# Fit model
model = sm.OLS(y, X).fit()

# Run diagnostics from stats subpackage
white_test = het_white(model.resid, model.model.exog)
print(f"White test p-value: {white_test[1]}")

Time series with custom imports:

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller

# Check stationarity
adf_result = adfuller(timeseries)

# Fit ARIMA
model = ARIMA(timeseries, order=(1,1,1)).fit()

Formula-based GLM with custom family:

import statsmodels.formula.api as smf
import statsmodels.api as sm

model = smf.glm(
    'count ~ treatment + age',
    data=df,
    family=sm.families.NegativeBinomial()
).fit()

Understanding module organization benefits

The topic-based structure prevents namespace pollution and enables efficient imports. You only load what you actually need for your analysis.

Subpackages make the codebase maintainable. Developers working on time series don’t affect regression code. New models get added to appropriate subpackages without disrupting existing functionality.

For users, the organization mirrors how statisticians think about methods. If you need discrete choice modeling, everything relevant lives in the discrete subpackage. Time series tools cluster together in tsa.

The dual API design balances convenience and explicitness. Import statsmodels.api for quick interactive work. Use direct imports for production code where you want crystal-clear dependencies.

This structure reflects a mature library designed for serious statistical work rather than a monolithic catch-all package.

Statsmodels Library Structure and Subpackages

How to Install Statsmodels (Windows, MacOS, Linux)

Black Friday Python Deals Came Early

Setting Up an R&D Center: Best Practices and Benefits

Statsmodels Library Structure and Subpackages

How the API structure works

Core subpackages and their purposes

regression

genmod (Generalized Models)

discrete

tsa (Time Series Analysis)

robust

stats

nonparametric

duration

multivariate

Supporting modules

tools

datasets

graphics

iolib

Formula API vs array API

Array API

Formula API

Import strategies

The sandbox

Navigating the documentation

Practical examples of structure usage

Understanding module organization benefits

Related Posts

How to Install Statsmodels (Windows, MacOS, Linux)

Black Friday Python Deals Came Early

Setting Up an R&D Center: Best Practices and Benefits