Cohort Analysis using Python: A Detailed Guide

I ran into this problem when an e-commerce client asked why their best customers kept disappearing after month two. The obvious metrics told us nothing. What we needed was a way to watch user behavior unfold over time, which is exactly what cohort analysis does.

This article covers cohort analysis from the ground up: what cohorts are, how to build them in Python, and how to read the retention heatmap that comes out the other end.

TLDR

Cohort analysis groups users by their first activity period and tracks retention over time
The core metric is the cohort index — months since a user’s first activity
A retention heatmap makes patterns visible at a glance
Python’s pandas and seaborn handle the full pipeline with minimal code
The approach works on any dataset with timestamps and user IDs

What is Cohort Analysis?

Cohort analysis is a behavioral analytics technique that groups users by a shared time-based characteristic — typically their first purchase month or first login date — and tracks how each group behaves over subsequent periods. Unlike overall metrics that blend all users together, cohort analysis reveals whether retention is improving or degrading and at which stage of the user lifecycle drop-offs happen.

The word “cohort” simply refers to a group of users who share something in common. In time-based cohort analysis, that common trait is the month they first appeared. A cohort for January 2024 contains every user whose first activity was in January 2024. Their row in the retention table shows what percentage came back in February, March, April, and so on.

Companies like Netflix, Spotify, and subscription e-commerce businesses use cohort analysis to answer specific questions: Are new users retaining better this quarter than last? At what point do users churn most frequently? Does a particular acquisition channel produce more loyal customers?

Steps of Cohort Analysis

The cohort analysis pipeline has five stages. First, define the objective — ask a specific question before touching the data. “Are iOS users retained better than Android users?” and “What percentage of users from the March campaign returned after 90 days?” are good starting points.

Second, identify cohort membership. For each user, find their first activity date. This becomes the cohort label. If a user made their first purchase on March 15, they belong to the March cohort regardless of when they made subsequent purchases.

Third, calculate the cohort index. For each activity record, compute how many months have passed since the user’s cohort month.

Fourth, build the retention table. Count unique users in each cohort at each cohort index, then divide by the cohort size to get retention percentages.

Fifth, visualize and interpret. A heatmap of retention percentages reveals patterns. Strong retention in month 0 followed by a sharp drop in month 1 indicates a bad onboarding experience. Gradual decline over many months suggests the product has a long-term value problem rather than a first-impression problem.

Cohort Analysis in Python

Python’s pandas and seaborn handle the full cohort analysis pipeline. The code below uses the Online Retail dataset from Kaggle — the same type of timestamped transaction data you’d find in any e-commerce database.

Start by loading the required libraries and the dataset:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

data = pd.read_excel("Online Retail.xlsx")
print(f"Dataset shape: {data.shape}")
print(data[['CustomerID', 'InvoiceDate']].head())

Loading the dataset shows its shape and confirms the CustomerID and InvoiceDate columns needed for cohort analysis.


Dataset shape: (541909, 8)
  CustomerID   InvoiceDate
0     17850.0  2010-12-01 08:26:00
1     17850.0  2010-12-01 08:26:00
2     17850.0  2010-12-01 08:26:00

Rows without a CustomerID cannot be attributed to any cohort and must be removed:


data = data.dropna(subset=['CustomerID'])
data['CustomerID'] = data['CustomerID'].astype(int)
print(f"Records after removing nulls: {len(data)}")

After filtering to records with a CustomerID, the dataset is ready for cohort month assignment.


Records after removing nulls: 406829

Next, create a cohort month column by extracting the year and month from InvoiceDate. The get_cohort_month function normalizes each date to the first day of its month using Python’s datetime module:


def get_cohort_month(m):
    return dt.datetime(m.year, m.month, 1)

data['InvoiceMonth'] = data['InvoiceDate'].apply(get_cohort_month)
data['CohortMonth'] = data.groupby('CustomerID')['InvoiceMonth'].transform('min')
print(data[['CustomerID', 'InvoiceMonth', 'CohortMonth']].head())

The CohortMonth column holds each user’s first activity month. The transform('min') call ensures every row for a given customer gets their earliest invoice month, regardless of which invoice the current row represents.


   CustomerID InvoiceMonth CohortMonth
0       17850  2010-12-01  2010-12-01
1       17850  2010-12-01  2010-12-01
2       17850  2010-12-01  2010-12-01

Now calculate the cohort index — the number of months between the cohort month and the current invoice month:


data['InvoiceYear'] = data['InvoiceMonth'].dt.year
data['InvoiceMonthNum'] = data['InvoiceMonth'].dt.month
data['CohortYear'] = data['CohortMonth'].dt.year
data['CohortMonthNum'] = data['CohortMonth'].dt.month

data['CohortIndex'] = (
    (data['InvoiceYear'] - data['CohortYear']) * 12
    + (data['InvoiceMonthNum'] - data['CohortMonthNum'])
    + 1
)
print(data[['CustomerID', 'CohortMonth', 'InvoiceMonth', 'CohortIndex']].head(10))

The cohort index starts at 1 for the first activity month. Each subsequent month the user appears increments the index by 1.


   CustomerID CohortMonth InvoiceMonth  CohortIndex
0       17850  2010-12-01  2010-12-01            1
1       17850  2010-12-01  2010-12-01            1
2       17850  2010-12-01  2010-12-01            1

Build a retention table using pandas’ pivot table where rows are cohort months and columns are cohort indices, with values showing the count of unique customers:


cohort_data = (
    data.groupby(['CohortMonth', 'CohortIndex'])['CustomerID']
    .nunique()
    .reset_index()
)

cohort_table = cohort_data.pivot(
    index='CohortMonth',
    columns='CohortIndex',
    values='CustomerID'
)
cohort_size = cohort_table.iloc[:, 0]
retention_table = cohort_table.divide(cohort_size, axis=0)
print(retention_table.round(3))

The retention table shows what fraction of each cohort returned in each subsequent month. Values closer to 1.0 mean higher retention. Each cell represents the percentage of that cohort still active in that month.


CohortIndex          1      2      3      4      5
CohortMonth
2010-12-01        1.000  0.314  0.217  0.177  0.126
2011-01-01        1.000  0.332  0.210  0.145  0.107
2011-02-01        1.000  0.349  0.224  0.157    NaN

Now visualize the retention table as a heatmap:


plt.figure(figsize=(14, 8))
sns.heatmap(
    retention_table,
    annot=True,
    fmt='.0%%',
    cmap='Greens',
    linewidths=0.5
)
plt.title('Monthly Cohort Retention Rate')
plt.ylabel('Cohort Month')
plt.xlabel('Months Since First Purchase')
plt.tight_layout()
plt.show()

The heatmap renders a grid of retention percentages colored by intensity. Darker green cells indicate higher retention, and each row represents a cohort tracked from its first activity month.


# plt.show() displays the heatmap figure inline.
# The visualization itself is the output.

How to Interpret Cohort Retention Data

The most important pattern to look for is where the drop happens. If month 1 retention is significantly lower than month 0, users are not coming back after their first experience. This points to an onboarding or first-impression problem. If retention drops gradually over months 3 through 6, the product likely has a long-term engagement issue rather than a bad first impression.

Cohort analysis also lets you compare retention across time periods. If your December cohort retains better than your September cohort at the same cohort index, something you did between September and December improved the user experience. Conversely, a degrading retention curve across successive cohorts is an early warning sign that needs investigation.

For e-commerce specifically, cohort retention is a direct revenue signal. Users who return in month 2 are far more likely to convert again than users who never return. Tracking cohort metrics like average order value over time gives you a forward-looking revenue model that top-line metrics cannot provide.

Beyond retention rates, cohort analysis can be extended to measure customer lifetime value by multiplying retention curves by average revenue per user. This is the foundation of most subscription business models where the question is not just “are users returning” but “how much are they worth over time.”

FAQ

Q: What is the difference between cohort analysis and segmentation?

Cohort analysis is a type of segmentation where the grouping variable is time-based. Other segmentation approaches might group users by acquisition channel, device type, or purchase frequency. Cohort analysis specifically tracks behavior over time, which is what makes it powerful for retention analysis.

Q: Can cohort analysis be used without a user ID?

Without a user ID, you cannot track the same individual across multiple transactions. You can still do transaction-level cohort analysis by grouping invoices by month, but this measures transaction volume rather than user retention. For true cohort retention, a unique user identifier is required.

Q: What is a good retention rate for e-commerce?

It varies widely by business model. Subscription businesses might target 80%+ month-1 retention, while one-time purchase e-commerce may see 20-30% returning in month 2. The value of cohort analysis is not in comparing your numbers to an industry benchmark but in tracking whether your own retention is improving or declining over time.

Q: What libraries besides seaborn can create cohort heatmaps?

The plotly library offers interactive heatmaps that let you hover over cells to see exact values. The seaborn heatmap is the standard choice for static reports and publications.

Cohort analysis turns a flat transaction log into a story about user behavior over time — and that story directly informs product decisions, marketing spend, and retention strategy.

Cohort Analysis using Python: A Detailed Guide

How to Provision Cloud Infrastructure Using Python with CDK for Terraform

How to Become a Freelance Python Developer: From Zero Experience to Paid Projects

Deploying High-Performance Python Web Applications: Best Practices for Developers

Cohort Analysis using Python: A Detailed Guide

TLDR

What is Cohort Analysis?

Steps of Cohort Analysis

Cohort Analysis in Python

How to Interpret Cohort Retention Data

FAQ

Related Posts

How to Provision Cloud Infrastructure Using Python with CDK for Terraform

How to Become a Freelance Python Developer: From Zero Experience to Paid Projects

Deploying High-Performance Python Web Applications: Best Practices for Developers