You’ve probably seen data where a simple straight line just doesn’t cut it. Maybe you’re modeling bike rentals and temperature, where the relationship looks more like a mountain than a slope. Or perhaps you’re analyzing medical data where effects taper off at extreme values. This is where Generalized Additive Models come in.
Statsmodels provides GAM functionality that handles penalized estimation of smooth terms in generalized linear models, letting you model complex patterns without losing interpretability. Think of GAMs as the middle ground between rigid linear models and black-box machine learning.
Statsmodel Beginner’s Learning Path
What Problems Do GAMs Actually Solve?
Linear regression assumes your features have a straight-line relationship with your outcome. Real data laughs at this assumption. Between 0 and 25 degrees Celsius, temperature might have a linear effect on bike rentals, but at higher temperatures the effect levels off or even reverses.
GAMs replace each linear term in your regression equation with a smooth function. Instead of forcing a straight line, they fit flexible curves that adapt to your data’s natural shape. The key difference from something like polynomial regression is that GAMs use splines, which are piecewise polynomials that connect smoothly at specific points called knots.
Here’s what makes this useful. You can capture common nonlinear patterns that classic linear models miss, including hockey stick curves where you see sharp changes, or mountain-shaped curves that peak and decline. And unlike random forests or neural networks, you can still explain what your model is doing.
How Do You Build a GAM in Statsmodels?
Let’s work through a concrete example using automobile data. The basic workflow involves creating spline basis functions, then fitting a GLMGam model.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.gam.api import GLMGam, BSplines
# Load your data
df = pd.read_csv('auto_data.csv')
# Select features for smoothing
x_spline = df[['weight', 'horsepower']]
# Create B-spline basis
bs = BSplines(x_spline, df=[12, 10], degree=[3, 3])
# Set penalization weights
alpha = np.array([21833888.8, 6460.38479])
# Fit the model
gam = GLMGam.from_formula(
'city_mpg ~ fuel + drive',
data=df,
smoother=bs,
alpha=alpha
)
results = gam.fit()
print(results.summary())
The degrees of freedom parameter controls how many parameters you estimate, which determines the wiggliness of your spline. Higher degrees of freedom mean more knots and a more flexible curve. A cubic spline with degree 3 and 4 knots has 7 degrees of freedom.
The relationship goes like this: for a cubic spline without an intercept, df = number of knots + degree. So if you specify df=5 with degree=3, you get 2 knots.
What’s Actually Happening Under the Hood?
A smooth function in GAMs is essentially a weighted sum of basis functions, where basis functions are the building blocks that combine to create your smooth curve. Think of it like constructing a complex shape from simpler pieces.
Each smooth term in your model gets represented as a sum of these basis functions with weights that get estimated during fitting. The penalization parameter (alpha) controls how much you punish wiggliness in the fitted curve.
Here’s a more detailed example showing how to set this up:
from statsmodels.gam.api import GLMGam, BSplines
# Prepare features for smoothing
x_spline = df[['feature1', 'feature2']]
# Create cubic splines with specific degrees of freedom
# df controls flexibility - higher = more wiggly
bs = BSplines(
x_spline,
df=[8, 6], # Different smoothness for each feature
degree=[3, 3] # Cubic splines for both
)
# Penalization prevents overfitting
# Higher alpha = smoother curves
alpha = [1000, 500]
# Fit model with both smooth and linear terms
gam = GLMGam.from_formula(
'response ~ category + smooth_feature',
data=df,
smoother=bs,
alpha=alpha
)
results = gam.fit()
The beauty here is you can mix smooth and linear terms. Categorical variables like fuel type or drive configuration stay as regular linear terms, while continuous variables get the smooth treatment.
How Do You Choose the Right Smoothness?
This is where GAMs get interesting. The smoothing parameter lets you explicitly balance the bias-variance tradeoff, with smoother curves having more bias but less variance. Too smooth and you miss important patterns. Too wiggly and you’re fitting noise.
The penalization parameter (alpha) is your control knob. Set it to zero and you get maximum flexibility but risk overfitting. Crank it up and your curve becomes smoother, potentially more believable, but might miss real effects.
Here’s a practical approach to finding good values:
# Try different alpha values
alphas = [0, 10, 100, 1000, 10000]
for alpha_val in alphas:
gam = GLMGam.from_formula(
'y ~ x',
data=df,
smoother=bs,
alpha=alpha_val
)
results = gam.fit()
# Compare fits using AIC or cross-validation
print(f"Alpha: {alpha_val}, AIC: {results.aic}")
Look at your fitted curves. Does the relationship make sense? Could you explain it to someone? If your curve has bizarre wiggles that don’t match domain knowledge, increase the penalization.
Can You Use Different Types of Splines?
Statsmodels supports different spline types for different scenarios. Beyond B-splines, you can use cyclic cubic regression splines for data with seasonal patterns.
from statsmodels.gam.api import CyclicCubicSplines
# For time series with seasonal effects
x_cyclic = df[['day_of_year']]
cs = CyclicCubicSplines(x_cyclic, df=[10])
gam = GLMGam.from_formula(
'sales ~ trend + seasonal',
data=df,
smoother=cs,
alpha=100
)
results = gam.fit()
Cyclic splines ensure the curve connects smoothly at the boundaries, which makes sense when modeling phenomena that repeat like daily or yearly patterns.
How Do You Interpret GAM Results?
The output shows coefficients for your linear terms and spline basis functions. Here’s what to look for:
results = gam.fit()
print(results.summary())
# Linear terms show up like regular regression
# Spline terms show up as feature_s0, feature_s1, etc.
# Each represents a basis function coefficient
# Get predictions
predictions = results.predict(df)
# Plot partial effects
results.plot_partial(0, figsize=(10, 6))
The spline coefficients themselves aren’t directly interpretable, but plotting the partial effects shows you the shape of the relationship. This is what you’d show stakeholders or include in a report.
For a feature like weight, you might see the curve drops sharply at first, then levels off. That tells you weight has a strong negative effect on fuel economy for lighter cars, but the effect diminishes for heavier vehicles.
What Are Common Gotchas?
First, make sure your basis dimension isn’t too restrictive. If the effective degrees of freedom are close to the basis dimension you specified, your splines might be constrained. Try increasing the df parameter.
Second, watch out for extrapolation. GAMs can behave strangely outside the range of your training data. Always check predictions against the ranges you fitted on.
Third, penalization weights need tuning. The values in the statsmodels examples aren’t magic numbers. Use cross-validation or information criteria to find values that work for your data.
from sklearn.model_selection import cross_val_score
# This requires a wrapper, but the concept applies
def cv_gam(alpha_val):
gam = GLMGam.from_formula(
'y ~ x',
data=train_df,
smoother=bs,
alpha=alpha_val
)
results = gam.fit()
return results.aic
# Find best alpha
best_alpha = min(alphas, key=cv_gam)
When Should You Use GAMs Instead of Alternatives?
Use GAMs when you need to model nonlinear relationships but still want interpretability. They sit on the interpretability versus predictive power continuum where they perform almost as well as complex models but remain explainable.
Choose random forests or gradient boosting when you have complex interactions between many features and don’t need to explain the exact functional form. But if you’re presenting to non-technical stakeholders or need to justify your model’s behavior, GAMs let you show actual curves and say “here’s exactly how this feature affects the outcome.”
For medical research, insurance pricing, or any regulated industry where you need to explain your predictions, GAMs are often the right tool. You get the flexibility to model real patterns without sacrificing transparency.
Putting It Together
Here’s a complete workflow you can adapt:
import statsmodels.api as sm
from statsmodels.gam.api import GLMGam, BSplines
import matplotlib.pyplot as plt
# Load and prepare data
df = pd.read_csv('your_data.csv')
# Select smooth vs linear features
smooth_features = df[['temperature', 'humidity']]
bs = BSplines(smooth_features, df=[10, 8], degree=[3, 3])
# Fit model mixing smooth and linear terms
gam = GLMGam.from_formula(
'outcome ~ category + region + smooth_terms',
data=df,
smoother=bs,
alpha=[1000, 800]
)
results = gam.fit()
# Examine fit
print(results.summary())
print(f"AIC: {results.aic}")
# Visualize relationships
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
results.plot_partial(0, ax=axes[0])
results.plot_partial(1, ax=axes[1])
plt.tight_layout()
plt.show()
# Make predictions
new_data = pd.DataFrame({
'temperature': [20, 25, 30],
'humidity': [50, 60, 70],
'category': ['A', 'B', 'A'],
'region': ['North', 'South', 'North']
})
predictions = results.predict(new_data)
GAMs give you a powerful way to handle real-world data that doesn’t fit neat linear assumptions. The statsmodels implementation provides the tools you need, though you’ll need to invest some time understanding how to set degrees of freedom and penalization weights for your specific problem.
Further Reading:
- Statsmodels GAM documentation at https://www.statsmodels.org/stable/gam.html
- Introduction to GAMs with practical examples at https://kirenz.github.io/regression/docs/gam.html

