Statsmodels organizes its functionality into topic-based subpackages rather than dumping everything into a single namespace. Understanding this structure helps you find the right models quickly and import them efficiently.
The library provides two primary access points: statsmodels.api for general use and statsmodels.formula.api for R-style formula syntax. Beyond these, specialized subpackages contain models, tools, and functions organized by statistical domain.
Statsmodel Beginner’s Learning Path
How the API structure works
When you import statsmodels.api, you’re not loading the entire library. The API module collects the most commonly used classes and functions from various subpackages and presents them through a clean interface.
Standard import convention:
import statsmodels.api as sm
import statsmodels.formula.api as smf
These imports give you access to regression models, GLMs, time series tools, and statistical tests without navigating the full directory structure. The API makes the most useful items available within one or two attribute levels.
What the API includes:
- Core regression models (OLS, WLS, GLS, GLSAR)
- Generalized linear models and families
- Discrete choice models (Logit, Probit, Poisson)
- Robust regression (RLM)
- Mixed effects models (MixedLM)
- Time series analysis classes
- Statistical tests and diagnostic tools
- Datasets for examples and testing
The API doesn’t include every function in the library. Specialized features often require direct imports from their subpackages.
Core subpackages and their purposes
Statsmodels groups related functionality into subpackages that match statistical modeling categories. Each subpackage focuses on a specific type of analysis.
regression
Contains linear regression models and related tools.
Main classes:
- OLS: Ordinary Least Squares regression
- WLS: Weighted Least Squares
- GLS: Generalized Least Squares
- GLSAR: GLS with autoregressive errors
- QuantReg: Quantile regression
- RecursiveLS: Recursive least squares
- MixedLM: Mixed effects linear models
Common usage:
from statsmodels.regression.linear_model import OLS
# or through the API
import statsmodels.api as sm
model = sm.OLS(y, X)
The regression subpackage provides the foundation for most statistical modeling work. These models handle continuous response variables with various assumptions about error structure.
genmod (Generalized Models)
Handles generalized linear models where the response variable follows distributions other than normal.
Main classes:
- GLM: Generalized Linear Model
- GEE: Generalized Estimating Equations
- families: Distribution families (Gaussian, Binomial, Poisson, etc.)
- cov_struct: Covariance structures for GEE
Distribution families available:
- Binomial for binary and proportion data
- Poisson for count data
- Gamma for positive continuous data
- Inverse Gaussian for skewed positive data
- Negative Binomial for overdispersed counts
import statsmodels.api as sm
model = sm.GLM(y, X, family=sm.families.Binomial())
GLMs extend linear regression to handle different response distributions while maintaining the linear predictor framework.
discrete
Focuses on models where the dependent variable is categorical or count-based.
Main classes:
- Logit: Binary logistic regression
- Probit: Binary probit regression
- MNLogit: Multinomial logit for unordered categories
- Poisson: Poisson regression for count data
- NegativeBinomial: For overdispersed count data
- ZeroInflatedPoisson: Handles excess zeros in count data
- ZeroInflatedNegativeBinomial: Combines zero inflation with overdispersion
from statsmodels.discrete.discrete_model import Logit
# or
import statsmodels.api as sm
model = sm.Logit(y, X)
Discrete models handle situations where your outcome isn’t continuous. Use these for classification, count analysis, or choice modeling.
tsa (Time Series Analysis)
Comprehensive time series modeling toolkit with univariate and multivariate methods.
Main classes and modules:
- AR: Autoregressive models
- ARIMA: Autoregressive Integrated Moving Average
- SARIMAX: Seasonal ARIMA with exogenous variables
- VAR: Vector Autoregression for multiple time series
- stattools: ACF, PACF, stationarity tests
- filters: Time series filtering methods
- seasonal: Seasonal decomposition (STL)
from statsmodels.tsa.arima.model import ARIMA
# or
import statsmodels.tsa.api as tsa
Time series analysis in Statsmodels covers forecasting, trend analysis, seasonal patterns, and multivariate dynamics. The stattools module provides diagnostic functions for checking stationarity and identifying appropriate model orders.
robust
Implements regression methods that handle outliers and violations of normality assumptions.
Main class:
- RLM: Robust Linear Model using M-estimators
from statsmodels.robust.robust_linear_model import RLM
# or
import statsmodels.api as sm
model = sm.RLM(y, X)
Robust regression automatically downweights observations with large residuals, preventing outliers from dominating your parameter estimates.
stats
Statistical tests, diagnostics, and hypothesis testing tools.
Submodules:
- diagnostic: Heteroscedasticity tests, autocorrelation tests, normality tests
- outliers_influence: Leverage, Cook’s distance, influence measures
- multitest: Multiple testing corrections
- proportion: Tests for proportions
- weightstats: Weighted statistics and tests
- anova: ANOVA tables and methods
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
The stats subpackage provides the diagnostic tools you need to validate model assumptions and detect problems in your analysis.
nonparametric
Non-parametric methods that don’t assume specific functional forms or distributions.
Functionality:
- Kernel density estimation
- Kernel regression (local polynomial fitting)
- Smoothing splines
- Non-parametric hypothesis tests
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_regression import KernelReg
Use these when your relationships aren’t linear or you want to explore data structure without imposing parametric assumptions.
duration
Survival analysis and duration models.
Classes for:
- Proportional hazards models
- Survival functions
- Hazard estimation
Useful when analyzing time-to-event data like customer churn, equipment failure, or time until a specific outcome.
multivariate
Multivariate statistical methods for analyzing multiple dependent variables simultaneously.
Models:
- Factor analysis
- MANOVA (Multivariate ANOVA)
- Canonical correlation
Supporting modules
Several modules provide infrastructure and utilities that support model fitting and analysis.
tools
Helper functions for data manipulation, matrix operations, and model utilities.
Common functions:
- add_constant: Adds intercept column to design matrix
- categorical: Creates dummy variables from categorical data
- Data validation and transformation utilities
import statsmodels.api as sm
X_with_const = sm.add_constant(X)
You’ll use add_constant frequently since most models need an intercept term but don’t add it automatically.
datasets
Built-in datasets for learning, examples, and testing.
import statsmodels.api as sm
data = sm.datasets.get_rdataset('mtcars', 'datasets')
spector = sm.datasets.spector.load_pandas()
The datasets module includes classic econometric and statistical datasets. Each dataset comes with documentation describing its source and variables.
graphics
Plotting functions for statistical graphics and model diagnostics.
Available plots:
- Regression plots with confidence bands
- Partial regression plots
- Influence plots
- CCPR plots (component plus residual)
- Time series plots
from statsmodels.graphics.regressionplots import plot_leverage_resid2
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
These plotting functions integrate with Matplotlib to create publication-quality statistical graphics.
iolib
Input/output tools for reading data from various formats.
- Reading Stata .dta files
- Table formatting for results
- Summary table creation
Useful when you’re working with data from statistical software like Stata or need to export formatted tables.
Formula API vs array API
Statsmodels provides two ways to specify models, each suited to different workflows.
Array API
Uses NumPy arrays or Pandas DataFrames directly. You construct your design matrix manually.
import statsmodels.api as sm
import numpy as np
X = np.random.rand(100, 3)
y = np.random.rand(100)
X = sm.add_constant(X) # Add intercept
model = sm.OLS(y, X).fit()
The array API gives you complete control over the design matrix. Use this when you need custom transformations or when you’re building programmatic workflows.
Formula API
Uses R-style formulas to specify models. The library handles dummy variable creation, interactions, and transformations automatically.
import statsmodels.formula.api as smf
import pandas as pd
# Data must be in a DataFrame
data = pd.DataFrame({
'y': np.random.rand(100),
'x1': np.random.rand(100),
'x2': np.random.rand(100),
'category': np.random.choice(['A', 'B', 'C'], 100)
})
model = smf.ols('y ~ x1 + x2 + C(category)', data=data).fit()
The formula interface is cleaner for exploratory analysis and when working with DataFrames. Categorical variables get automatic dummy coding. Transformations like np.log() work directly in formulas.
Import strategies
Different import patterns work better for different situations.
For interactive analysis:
import statsmodels.api as sm
import statsmodels.formula.api as smf
This gives you access to commonly used models with minimal typing.
For production code:
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.diagnostic import het_breuschpagan
Direct imports make dependencies explicit and avoid loading unnecessary modules.
For subpackage-specific work:
import statsmodels.tsa.api as tsa
model = tsa.SARIMAX(data, order=(1,1,1))
Time series, statistics, and graphics subpackages have their own API modules for focused work.
The sandbox
Statsmodels includes a sandbox directory containing experimental code that isn’t considered production-ready. Features in the sandbox might have incomplete testing, limited documentation, or unstable APIs.
Sandbox contents:
- Experimental regression methods
- Developmental time series models
- Prototype statistical tests
- Research-stage implementations
Avoid using sandbox code in production unless you’re willing to maintain it yourself if the API changes. Features eventually migrate from sandbox to main subpackages once they’re stable.
Navigating the documentation
The official Statsmodels documentation organizes material differently than the code structure.
Documentation structure:
- User Guide: Topic-based tutorials and explanations
- API Reference: Complete class and function listings
- Examples: Jupyter notebooks demonstrating workflows
Finding what you need:
If you know the model type (linear regression, logistic regression, ARIMA), check the User Guide for the relevant section. It explains when to use each model and shows worked examples.
For specific function signatures and parameters, use the API Reference. It lists every public class and function with complete parameter documentation.
Practical examples of structure usage
Linear regression with diagnostics:
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white
# Fit model
model = sm.OLS(y, X).fit()
# Run diagnostics from stats subpackage
white_test = het_white(model.resid, model.model.exog)
print(f"White test p-value: {white_test[1]}")
Time series with custom imports:
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
# Check stationarity
adf_result = adfuller(timeseries)
# Fit ARIMA
model = ARIMA(timeseries, order=(1,1,1)).fit()
Formula-based GLM with custom family:
import statsmodels.formula.api as smf
import statsmodels.api as sm
model = smf.glm(
'count ~ treatment + age',
data=df,
family=sm.families.NegativeBinomial()
).fit()
Understanding module organization benefits
The topic-based structure prevents namespace pollution and enables efficient imports. You only load what you actually need for your analysis.
Subpackages make the codebase maintainable. Developers working on time series don’t affect regression code. New models get added to appropriate subpackages without disrupting existing functionality.
For users, the organization mirrors how statisticians think about methods. If you need discrete choice modeling, everything relevant lives in the discrete subpackage. Time series tools cluster together in tsa.
The dual API design balances convenience and explicitness. Import statsmodels.api for quick interactive work. Use direct imports for production code where you want crystal-clear dependencies.
This structure reflects a mature library designed for serious statistical work rather than a monolithic catch-all package.

