Statsmodels Robust Linear Models - AskPython

You’re running a regression on your sales data, and a few extreme values are throwing off your predictions. Maybe it’s a single huge order, or data entry errors, or legitimate edge cases you can’t just delete. Standard linear regression treats every point equally, which means those outliers pull your coefficients in the wrong direction. Robust Linear Models in statsmodels give you a better option.

Statsmodel Beginner’s Learning Path

What makes robust regression different from regular OLS?

Ordinary least squares regression gives outliers disproportionate influence because errors are squared. An outlier with twice the typical error contributes four times as much to the loss function. Robust Linear Models use iteratively reweighted least squares with M-estimators that downweight outliers instead of amplifying their impact.

Think of it this way: OLS assumes all your data points are equally trustworthy. RLM asks “how much should I trust each observation?” and adjusts accordingly. Points that look like outliers get lower weights, so they influence the final model less.

The math behind this involves M-estimators, which minimize a function of residuals that grows more slowly than squared errors. Peter Huber introduced M-estimation for regression in 1964, and it remains the foundation for most robust regression methods today.

How do I fit a basic robust regression model?

Here’s a simple example using statsmodels:

import statsmodels.api as sm
import numpy as np

# Generate data with outliers
np.random.seed(42)
X = np.linspace(0, 10, 50)
y = 2 * X + 1 + np.random.normal(0, 1, 50)

# Add some outliers
y[5] = 30
y[15] = -10
y[25] = 35

# Add constant term for intercept
X_with_const = sm.add_constant(X)

# Fit robust linear model
rlm_model = sm.RLM(y, X_with_const, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

# Compare with OLS
ols_model = sm.OLS(y, X_with_const)
ols_results = ols_model.fit()

print("RLM coefficients:", rlm_results.params)
print("OLS coefficients:", ols_results.params)

The RLM class requires you to manually add the intercept using add_constant. The M parameter specifies which robust criterion to use. HuberT is the default and works well for most cases.

The results show you immediately how much those outliers were pulling your OLS estimates around. RLM gives you coefficients closer to the true relationship.

Which M-estimator should I choose?

Statsmodels offers several M-estimators: HuberT, Hampel, AndrewWave, RamsayE, TrimmedMean, and TukeyBiweight. Each handles outliers differently.

Huber: The default choice. It uses squared loss for small residuals and absolute loss for large ones. This gives you a balance between efficiency and robustness.

huber_model = sm.RLM(y, X_with_const, M=sm.robust.norms.HuberT())
huber_results = huber_model.fit()

Hampel: More aggressive at downweighting outliers. Good when you have many extreme values.

hampel_model = sm.RLM(y, X_with_const, M=sm.robust.norms.Hampel())
hampel_results = hampel_model.fit()

AndrewWave: Completely eliminates the influence of severe outliers by setting their weight to zero beyond a threshold.

andrew_model = sm.RLM(y, X_with_const, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_model.fit()

The choice depends on your data. If you have occasional outliers, Huber works fine. If outliers are common or extreme, try Hampel or AndrewWave.

How does the Huber loss function work?

The Huber loss is quadratic for small residuals and linear for large ones, with a tuning parameter delta that determines the threshold. Below the threshold, it behaves like squared error. Above it, it grows linearly instead of quadratically.

Here’s what that means in practice: For most of your data points with small errors, Huber loss acts like regular OLS, which is statistically efficient. For the few points with large errors, it switches to absolute loss, preventing them from dominating the fit.

The formula looks like this:

For |a| ≤ δ: Loss = (1/2) * a²
For |a| > δ: Loss = δ * (|a| – δ/2)

Where ‘a’ is the residual and δ is the tuning constant. This combination preserves sensitivity to normal observations while maintaining robustness against outliers.

What about scale estimation?

RLM uses scale estimation to determine how to weight residuals, with ‘mad’ (median absolute deviation) as the default. You can also use HuberScale for a different approach:

model = sm.RLM(y, X_with_const, M=sm.robust.norms.Hampel())
results = model.fit(scale_est=sm.robust.scale.HuberScale())

MAD is simpler and works well for most cases. HuberScale is an iterative method that can be more accurate but takes longer to compute. Unless you’re dealing with particularly tricky data, stick with the default.

How do I examine the weights assigned to each observation?

After fitting your model, you can inspect which observations were downweighted:

# Fit the model
rlm_results = sm.RLM(y, X_with_const).fit()

# Get the weights
weights = rlm_results.weights

# Find observations with low weights (likely outliers)
outliers = np.where(weights < 0.5)[0]
print(f"Potential outliers at indices: {outliers}")
print(f"Their weights: {weights[outliers]}")

Weights range from 0 to 1. Values close to 1 mean the observation is treated normally. Values close to 0 indicate the model considers it an outlier. You can use this to identify problematic data points without manually specifying thresholds.

Can I use robust regression for prediction?

Yes, and it works exactly like OLS:

# New data
X_new = np.array([[1, 5], [1, 7], [1, 9]])  # Remember to include constant

# Make predictions
predictions = rlm_results.predict(X_new)
print(predictions)

# Get prediction intervals (though robust regression 
# typically focuses on coefficients more than intervals)

The predictions use your robust coefficient estimates, so they’re less influenced by outliers in your training data. This often gives you more reliable predictions on new data, especially if your training set had contamination.

When should I use robust regression versus removing outliers?

Here’s the practical decision tree:

Use robust regression when:

You can’t easily identify which points are true outliers versus legitimate extreme values
You have many observations and don’t want to lose information
Outliers might contain useful signal you don’t want to completely discard
You need to automate the process and can’t manually review every dataset

Remove outliers when:

You can clearly identify data entry errors or measurement problems
The outliers are definitively invalid (negative prices, impossible dates, etc.)
You have domain knowledge that certain values are impossible
Your dataset is small and every observation matters

An advantage of M-estimators is that outlier detection and regression can be solved simultaneously, unlike approaches that require you to identify outliers first.

What about multiple regression with many features?

Robust regression scales to multiple features without issues:

# Multiple features
X1 = np.random.normal(0, 1, 100)
X2 = np.random.normal(5, 2, 100)
X3 = np.random.uniform(0, 10, 100)

X_multi = np.column_stack([X1, X2, X3])
X_multi = sm.add_constant(X_multi)

# True relationship with noise
y_multi = 3 + 2*X1 - 0.5*X2 + 1.2*X3 + np.random.normal(0, 1, 100)

# Add outliers
y_multi[10] = 50
y_multi[30] = -30

# Fit robust model
rlm_multi = sm.RLM(y_multi, X_multi)
results_multi = rlm_multi.fit()

print(results_multi.summary())

The summary output shows you coefficients, standard errors, z-scores, and p-values just like OLS. The interpretation is the same, but your estimates are more stable.

Does robust regression work with categorical variables?

Yes, after encoding them properly:

import pandas as pd

# Create sample data with categories
df = pd.DataFrame({
    'sales': np.random.normal(100, 20, 50),
    'category': np.random.choice(['A', 'B', 'C'], 50),
    'price': np.random.uniform(10, 50, 50)
})

# Add outliers
df.loc[5, 'sales'] = 300
df.loc[15, 'sales'] = 10

# One-hot encode categories
df_encoded = pd.get_dummies(df, columns=['category'], drop_first=True)

# Set up regression
y = df_encoded['sales']
X = df_encoded[['price', 'category_B', 'category_C']]
X = sm.add_constant(X)

# Fit
rlm_cat = sm.RLM(y, X)
results_cat = rlm_cat.fit()

The categorical variables work exactly as they do in regular regression. The robust estimation just makes the coefficients less sensitive to outliers in the response variable.

How do I compare robust and OLS results side by side?

Here’s a complete example showing the difference:

import matplotlib.pyplot as plt

# Generate clean data
np.random.seed(123)
X_clean = np.linspace(0, 10, 50)
y_clean = 2 + 3*X_clean + np.random.normal(0, 2, 50)

# Add significant outliers
X_clean[[5, 15, 25, 35]] = [2, 4, 6, 8]
y_clean[[5, 15, 25, 35]] = [40, 35, 45, 50]

# Prepare data
X_with_const = sm.add_constant(X_clean)

# Fit both models
ols_model = sm.OLS(y_clean, X_with_const)
ols_results = ols_model.fit()

rlm_model = sm.RLM(y_clean, X_with_const)
rlm_results = rlm_model.fit()

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X_clean, y_clean, alpha=0.5, label='Data')
plt.plot(X_clean, ols_results.fittedvalues, 'r-', 
         label=f'OLS (slope={ols_results.params[1]:.2f})')
plt.plot(X_clean, rlm_results.fittedvalues, 'g-', 
         label=f'RLM (slope={rlm_results.params[1]:.2f})')
plt.legend()
plt.xlabel('X')
plt.ylabel('y')
plt.title('OLS vs Robust Regression')
plt.show()

# Print coefficient comparison
print("\nCoefficient Comparison:")
print(f"OLS Intercept: {ols_results.params[0]:.3f}, Slope: {ols_results.params[1]:.3f}")
print(f"RLM Intercept: {rlm_results.params[0]:.3f}, Slope: {rlm_results.params[1]:.3f}")

The visualization makes it obvious when outliers are pulling your OLS line away from the main pattern. The RLM line stays truer to the bulk of your data.

What are the limitations I should know about?

M-estimation is robust to outliers in the response variable but not resistant to outliers in explanatory variables (leverage points). If you have extreme values in your predictors that are also outliers in the response, robust regression won’t fully solve your problem.

RLM results don’t include R-squared because outliers can make this metric unreliable. Instead, focus on coefficient standard errors, p-values, and confidence intervals to assess model quality.

The method requires more computation than OLS since it iteratively reweights observations. For massive datasets, this can take noticeably longer, though it’s rarely a practical problem with modern hardware.

Conclusion

Robust regression gives you a practical middle ground between blindly trusting all your data and manually removing outliers. The statsmodels implementation makes it straightforward to fit these models, and the different M-estimators let you tune how aggressively you want to downweight suspicious observations. For most applications, start with HuberT and examine the weights to understand which observations your model considers problematic. If those weights make sense given your domain knowledge, you’ve got a more reliable model than OLS would have given you.

Further Reading:

Statsmodels Robust Linear Models – AskPython

Google’s SynthID is supposed to find fake AI images. But it failed when it mattered most.

How to Use ChatGPT as Your Stock Analyst ($NVDA)

This Finviz Screener Finds Recession-Proof Stocks – Four Variables Suggested by AI