endog and exog: Statsmodels' Core Variable Naming Convention

When you first encounter Statsmodels code, you’ll likely run into two terms that seem designed to confuse: endog and exog. Most Python libraries use X and y for their variables. Scikit-learn does it. Pandas tutorials do it. Even introductory statistics courses use these letters. So why does Statsmodels break from convention with these strange terms?

The answer reveals something fundamental about how Statsmodels thinks about statistical modeling, and understanding these terms early will save you from countless head-scratching moments later.

Statsmodel Beginner’s Learning Path

What endog and exog Actually Mean

Let’s start with the definitions. In Statsmodels:

endog stands for “endogenous variable”—this is your dependent variable, the outcome you’re trying to predict or explain. In a simple equation like predicting house prices, the house price is your endog.
exog stands for “exogenous variable(s)”—these are your independent variables, the features you’re using to make predictions. In that same house price example, square footage, number of bedrooms, and neighborhood would be your exog.

The terminology comes from econometrics, where researchers distinguish between variables that are determined within the model (endogenous) and variables that come from outside the model (exogenous). Statsmodels was built by econometricians and statisticians who work in academic research, so they kept the terminology that’s standard in their field.

Why This Matters More Than You Think

The naming convention isn’t just academic pedantry. It reflects a philosophical difference in how Statsmodels approaches modeling compared to machine learning libraries.

Scikit-learn thinks about prediction. You have features (X) that you use to predict targets (y). The framework assumes you care primarily about accuracy and generalization to new data.

Statsmodels thinks about inference. You have variables whose relationships you want to understand, test, and interpret. The framework assumes you care about statistical significance, confidence intervals, and whether your model assumptions hold.

This distinction shows up everywhere in how you use the library. When you fit a Statsmodels model, you’re getting detailed statistical output about coefficients, standard errors, p-values, and diagnostic tests. The endog/exog terminology signals this different mindset from the start.

How to Structure Your Data for endog and exog

The most common question beginners ask: “What format should my data be in?”

For endog (your dependent variable):

Your endog should be a one-dimensional array-like object. This can be a pandas Series, a NumPy array, or even a Python list. The key requirement is that it contains just one column of data, your outcome variable.

import pandas as pd
import statsmodels.api as sm

# Loading some data
data = pd.read_csv('housing_data.csv')

# endog is a single column
endog = data['price']

That’s it. One column, representing the thing you want to predict or explain.

For exog (your independent variables):

Your exog should be a two-dimensional array-like object, even if you only have one independent variable. This is typically a pandas DataFrame or a NumPy array with shape (n_samples, n_features).

# exog can be multiple columns
exog = data[['square_feet', 'bedrooms', 'bathrooms']]

# Or even just one column, but still needs to be 2D
exog = data[['square_feet']]

Notice that even with one feature, we use double brackets [['square_feet']] to keep it two-dimensional. This matters because Statsmodels expects a certain shape, and passing a one-dimensional array when it expects two dimensions will cause errors.

The Constant Term Trap

Here’s where beginners hit their first real stumbling block. Unlike scikit-learn, Statsmodels does not automatically add a constant (intercept) term to your regression. You need to add it yourself.

# This will fit a regression WITHOUT an intercept
model = sm.OLS(endog, exog)

# This is what you usually want with an intercept
exog_with_const = sm.add_constant(exog)
model = sm.OLS(endog, exog_with_const)

The sm.add_constant() function adds a column of ones to your exog data. This column represents the intercept term in your regression equation. Forgetting this step is one of the most common mistakes in Statsmodels, and it can dramatically change your results.

Why doesn’t Statsmodels add this automatically? Because in some statistical models, you genuinely don’t want an intercept. Econometricians and statisticians sometimes fit models that pass through the origin, and Statsmodels gives you explicit control over this choice rather than making assumptions about what you want.

A Complete Working Example

Let’s walk through a concrete example that shows how endog and exog work together in practice.

Suppose you’re analyzing factors that influence employee salaries. You have data on years of experience, education level, and whether someone works in a management position. Salary is what you want to explain.

import pandas as pd
import statsmodels.api as sm

# Your data
salaries_df = pd.DataFrame({
    'salary': [55000, 62000, 71000, 58000, 88000, 95000],
    'experience_years': [2, 4, 7, 3, 10, 12],
    'education_level': [16, 16, 18, 14, 18, 20],
    'is_manager': [0, 0, 1, 0, 1, 1]
})

# Setting up endog and exog
endog = salaries_df['salary']
exog = salaries_df[['experience_years', 'education_level', 'is_manager']]

# Adding the constant
exog = sm.add_constant(exog)

# Fitting the model
model = sm.OLS(endog, exog)
results = model.fit()

# Viewing the results
print(results.summary())

In this setup, salary is your endog because that’s the outcome you’re studying. The other three variables form your exog because they’re the factors you believe influence salary. The constant gets added to exog, so your final exog DataFrame has four columns: a constant column plus your three predictor variables.

Common Patterns and Variations

As you work with Statsmodels, you’ll encounter several patterns for handling endog and exog.

Pattern 1: Using column names from a DataFrame

# If all your data is in one DataFrame
import statsmodels.formula.api as smf

# Using formula syntax (we'll cover this in depth in another article)
model = smf.ols('salary ~ experience_years + education_level + is_manager', data=salaries_df)
results = model.fit()

This formula approach handles the constant automatically and lets you specify variables by name. It’s more convenient for quick analyses.

Pattern 2: Selecting exog dynamically

# Getting all columns except the target
feature_columns = [col for col in data.columns if col != 'salary']
exog = data[feature_columns]
endog = data['salary']

This pattern works well when you want to use all available features without typing them individually.

Pattern 3: Creating transformed variables

# Creating a log-transformed endog
import numpy as np
endog = np.log(data['salary'])

# Or adding polynomial terms to exog
data['experience_squared'] = data['experience_years'] ** 2
exog = data[['experience_years', 'experience_squared', 'education_level']]

You can transform your variables before passing them to the model. Statsmodels will use whatever data you provide.

What Happens Inside the Model

When you create a model like sm.OLS(endog, exog), Statsmodels does several things behind the scenes:

First, it validates that your endog and exog have compatible shapes. The number of rows (observations) must match. If you have 100 rows in endog, you need 100 rows in exog.

Second, it stores references to your data. The model object doesn’t copy your data immediately, it keeps pointers to the arrays you passed in. This saves memory for large datasets.

Third, when you call .fit(), Statsmodels performs the actual statistical calculations using your endog and exog. For OLS (Ordinary Least Squares) regression, this means solving the equation that minimizes the sum of squared residuals.

The results object that .fit() returns contains everything: coefficient estimates, standard errors, t-statistics, p-values, and diagnostic measures. All of these come from the relationship between your endog and exog.

Handling Missing Data

Statsmodels has specific behavior around missing values that relates to how you structure endog and exog.

# Data with missing values
data_with_na = pd.DataFrame({
    'salary': [55000, np.nan, 71000, 58000],
    'experience': [2, 4, 7, 3]
})

endog = data_with_na['salary']
exog = sm.add_constant(data_with_na[['experience']])

# By default, Statsmodels will drop rows with any missing values
model = sm.OLS(endog, exog, missing='drop')
results = model.fit()

The missing parameter tells Statsmodels how to handle NA values. The default behavior missing='drop' removes any row where either endog or exog has a missing value. This keeps your analysis clean, but you need to be aware that you might lose data.

If you want to see how many observations were dropped, check the model’s attributes after fitting:

print(f"Number of observations used: {results.nobs}")

Multiple Endogenous Variables

Some advanced Statsmodels models support multiple endogenous variables. For example, in Vector Autoregression (VAR) for time series, you might have several variables that all depend on past values of each other.

# Multiple time series as endog
endog = data[['sales', 'marketing_spend', 'website_traffic']]

# VAR models treat all these as endogenous
from statsmodels.tsa.api import VAR
model = VAR(endog)

In this case, endog is two-dimensional because you have multiple variables that interact with each other over time. This is an exception to the usual “one column for endog” rule, and it only applies to specific model types.

Practical Tips for Working with endog and exog

Tip 1: Always check your shapes

Before fitting a model, print the shapes of your arrays:

print(f"endog shape: {endog.shape}")
print(f"exog shape: {exog.shape}")

For standard regression, endog should be (n,) or (n, 1) and exog should be (n, k) where n is the number of observations and k is the number of features (including the constant).

Tip 2: Use meaningful variable names

Even though Statsmodels uses endog and exog internally, you can keep your own variable names clear:

price = data['house_price']
features = data[['square_feet', 'bedrooms']]
features = sm.add_constant(features)

model = sm.OLS(endog=price, exog=features)

Using named arguments (endog=price) makes your code more readable.

Tip 3: Verify the constant was added

After adding the constant, check your DataFrame:

exog = sm.add_constant(original_features)
print(exog.head())

You should see a column named ‘const’ filled with 1.0 values. If this column is missing, your model won’t have an intercept term.

Tip 4: Keep your data together before splitting

Instead of creating separate endog and exog variables early, keep everything in one DataFrame and split at the last moment:

# Better workflow
full_data = pd.read_csv('data.csv')
# ... do all your cleaning and transformation here ...
# Split only when fitting the model
endog = full_data['target']
exog = sm.add_constant(full_data[predictor_columns])

This prevents synchronization issues where your endog and exog accidentally get out of alignment.

Why This Convention Helps

Once you get past the initial confusion, the endog/exog naming convention actually provides clarity. It forces you to think explicitly about which variable is your outcome and which are your predictors. This clear distinction matters when you’re interpreting statistical results.

In machine learning, you might casually try swapping what’s X and what’s y just to see what happens. But in statistical modeling, the direction of the relationship matters profoundly. Predicting salary from experience is a different question than predicting experience from salary, even though both models would run without errors.

The terminology also prepares you for more advanced statistical concepts. When you eventually encounter instrumental variables, two-stage least squares, or simultaneous equation models, the language of endogenous and exogenous variables becomes essential. These models explicitly deal with cases where your “exogenous” variables might actually be influenced by your outcome, violating standard regression assumptions.

Moving Forward

Understanding endog and exog is your foundation for everything else in Statsmodels. Every model you fit—whether it’s linear regression, logistic regression, time series models, or generalized linear models—uses this same basic structure. The specific parameters and options change, but the core pattern remains: you provide an endog variable representing what you want to model, and exog variables representing what you’re using to model it.

The next time you see sm.OLS(endog, exog) or sm.Logit(endog, exog), you’ll know exactly what’s expected. Your endog is one-dimensional, your outcome variable. Your exog is two-dimensional, your predictors (usually with a constant). This simple structure unlocks the entire library.

endog and exog: Statsmodels’ Core Variable Naming Convention

Introduction To Cryptocurrency Trading With Python

Why Do Enterprises Require Specialized Test Automation Solutions?

Automating Screenshot Generation for Web Applications in Python