I’ve built dozens of regression models over the years, and here’s what I’ve learned: the math behind linear regression is straightforward, but getting it right requires understanding what’s happening under the hood. That’s where statsmodels shines. Unlike scikit-learn, which optimizes for prediction, statsmodels gives you the statistical framework to understand relationships in your data.

Let’s work through linear regression in Python using statsmodels, from basic implementation to diagnostics that actually matter.

What is statsmodels and why use it for regression?

Statsmodels is a Python library that provides tools for estimating statistical models, including ordinary least squares (OLS), weighted least squares (WLS), and generalized least squares (GLS). Think of it as the statistical counterpart to scikit-learn. Where scikit-learn focuses on prediction accuracy, statsmodels focuses on inference: understanding which variables matter, quantifying uncertainty, and validating assumptions.

The library gives you detailed statistical output including p-values, confidence intervals, and diagnostic tests. This matters when you’re not just predicting house prices but explaining to stakeholders why square footage matters more than the number of bathrooms.

How do you implement a basic linear regression model?

Start with the simplest case: one predictor variable. Here’s a complete example using car data to predict fuel efficiency:

import statsmodels.api as sm
import pandas as pd
import numpy as np

# Load data
mtcars = sm.datasets.get_rdataset("mtcars").data

# Define variables
X = mtcars[["wt"]]  # Car weight
y = mtcars["mpg"]   # Miles per gallon

# Add constant term for intercept
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X)
results = model.fit()

# Display results
print(results.summary())

The summary output shows the coefficient of weight demonstrating how much mpg decreases with each additional unit of weight. You’ll see something like a coefficient of -5.34, meaning each additional 1,000 pounds reduces fuel efficiency by about 5.34 mpg.

The constant term is necessary. Adding a column of ones using sm.add_constant() creates the intercept parameter, allowing the model to learn both slope and y-intercept. Without it, you’re forcing the regression line through the origin, which rarely makes sense.

How do you interpret the model summary output?

The summary table contains more information than you’ll use regularly, but certain metrics deserve attention:

R-squared: Measures how much variance in your dependent variable the model explains. An R-squared of 0.75 means your predictors account for 75% of the variation. The closer R-squared is to 1, the more accurate the model.

P-values: Tell you whether each coefficient is statistically significant. A p-value below 0.05 suggests the relationship isn’t due to random chance. High p-values mean you should question whether that variable belongs in your model.

Coefficients: The actual effect size. A coefficient of -0.49 for literacy in a lottery participation model means each one-unit increase in literacy corresponds to a 0.49-unit decrease in lottery participation, holding other variables constant.

Standard errors: Quantify uncertainty around each coefficient. Large standard errors relative to the coefficient value indicate unstable estimates.

What about multiple regression with several predictors?

Real data rarely involves single predictors. Here’s how you handle multiple independent variables:

import statsmodels.formula.api as smf

# Using formula notation (R-style)
formula = 'housing_price_index ~ unemployment + gdp + interest_rate'
model = smf.ols(formula, data=housing_data)
results = model.fit()

print(results.summary())

The formula notation is cleaner for complex models. The left side of the ~ operator contains the dependent variable and the right side contains the independent variables.

When you add multiple predictors, watch for multicollinearity. If predictors are highly correlated, try removing one or more since they supply redundant information. For example, including both “total debt” and “monthly debt payments” in a credit model might create problems because they measure essentially the same thing.

How do you check if your model assumptions hold?

Linear regression rests on specific assumptions. Violate them and your results become unreliable. Here’s what to check:

Linearity: Plot residuals against fitted values. You want a random scatter with no pattern. If you see a curve, the relationship isn’t linear and you might need polynomial terms.

import matplotlib.pyplot as plt

# Get residuals and fitted values
residuals = results.resid
fitted_values = results.fittedvalues

# Plot residuals vs fitted values
plt.scatter(fitted_values, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()

If residuals show no obvious pattern and appear as a random scatter, the linearity assumption holds.

Normality of residuals: Use a Q-Q plot to check if residuals follow a normal distribution. Points should fall along the diagonal line.

import scipy.stats as stats

# Create Q-Q plot
fig, ax = plt.subplots()
stats.probplot(residuals, plot=ax)
plt.show()

Homoscedasticity: Residuals should have constant variance across all levels of fitted values. If the scatter plot shows a funnel shape (wider on one end), you have heteroscedasticity.

You can test this statistically using the Breusch-Pagan test:

import statsmodels.stats.diagnostic as diag

# Breusch-Pagan test for heteroscedasticity
bp_test = diag.het_breuschpagan(residuals, results.model.exog)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))

A p-value above 0.05 suggests homoscedasticity holds.

How do you identify influential observations and outliers?

Some data points exert disproportionate influence on your regression line. Outlier measures help identify observations with large residuals or large influence on regression estimates.

Cook’s distance quantifies influence. Values above 1 suggest problematic observations:

from statsmodels.stats.outliers_influence import OLSInfluence

# Calculate influence measures
influence = OLSInfluence(results)
cooks_d = influence.cooks_distance[0]

# Flag influential points
threshold = 1
influential_points = np.where(cooks_d > threshold)[0]
print(f"Influential observations: {influential_points}")

You can also examine leverage (how far a point is from other observations in predictor space) and studentized residuals (standardized residuals accounting for their variance).

What do you do when assumptions fail?

When diagnostic plots reveal problems, you have options:

For non-linearity: Add polynomial terms. If the relationship between age and income shows diminishing returns, include both age and age-squared:

data['age_squared'] = data['age'] ** 2
formula = 'income ~ age + age_squared + education'

For heteroscedasticity: Use weighted least squares (WLS) instead of OLS, giving less weight to observations with higher variance:

weights = 1 / (residuals ** 2)
model_wls = sm.WLS(y, X, weights=weights)
results_wls = model_wls.fit()

For outliers: Don’t automatically delete them. Investigate why they exist. They might represent data entry errors, or they might be the most interesting observations in your dataset.

How do you make predictions with your model?

Once you’ve validated your model, use it for predictions:

# Create new data for prediction
new_data = pd.DataFrame({
    'wt': [3.0, 3.5, 4.0],
    'const': [1, 1, 1]  # Don't forget the constant
})

# Make predictions
predictions = results.predict(new_data)
print(predictions)

# Get confidence intervals
predictions_summary = results.get_prediction(new_data)
confidence_intervals = predictions_summary.summary_frame()
print(confidence_intervals)

The confidence intervals tell you the range where you expect the true mean response to fall. Wider intervals indicate more uncertainty.

Conclusion

Statsmodels gives you the statistical machinery to understand your data properly. Unlike black-box prediction libraries, it forces you to think about assumptions, relationships, and uncertainty. Start with simple models, validate assumptions through diagnostics, and only add complexity when justified. The summary statistics aren’t decorative; they tell you whether your model makes sense.

For detailed statsmodels documentation, see the official statsmodels regression guide. For deeper statistical background, Montgomery and Peck’s “Introduction to Linear Regression Analysis” remains the standard reference.

Share.
Leave A Reply