Accuracy is a trap for anyone working with imbalanced classification. In fraud detection, cancer screening, churn prediction, and spam filtering, the majority class dominates by orders of magnitude. A model trained on such data can achieve 95% accuracy by simply predicting the majority class every time. That is not a useful model. The accuracy metric does not tell you that you have built something worthless.

Precision and recall solve this problem by measuring what the confusion matrix actually says. This article covers how both metrics work, when to prioritize each one, and how to compute them in Python with code you can run today.

TLDR

  • Precision = TP / (TP + FP). Of every positive prediction, how many were actually correct?
  • Recall = TP / (TP + FN). Of every actual positive, how many did the model catch?
  • The precision-recall tradeoff is fundamental. Lower threshold = higher recall, lower precision. Raise it for the reverse.
  • F1 is the harmonic mean of both. It penalizes extreme imbalance between precision and recall.
  • sklearn handles all of this with one-line functions: precision_score, recall_score, f1_score.

The Confusion Matrix Starts Everything

Before precision and recall make sense, you need to understand the confusion matrix. For binary classification, it is a 2×2 grid that describes every possible outcome when your model classifies something as positive or negative.

The four cells are: true positives (TP) where actual and predicted are both positive, true negatives (TN) where both are negative, false positives (FP) where the model predicted positive but the actual was negative, and false negatives (FN) where the model predicted negative but the actual was positive.

A fraud detection framing helps here. If you have 1000 transactions and 10 of them are fraud, the confusion matrix tells you exactly how your model performs on each category. That framing makes the formulas below feel less abstract.

TP, FP, FN, TN = 7, 5, 3, 987

print(f"Actual fraud transactions:   {TP + FN}")
print(f"Actual legitimate:          {FP + TN}")
print(f"Model flagged as fraud:      {TP + FP}")
print(f"Model flagged as clean:     {FN + TN}")

What Precision Actually Measures

Precision answers a specific question: of everything my model marked as positive, how many were actually positive? The formula is TP divided by TP plus FP. You can also think of it as the probability that a positive prediction is correct.

precision = TP / (TP + FP)
print(f"Precision = {TP}/({TP}+{FP}) = {precision:.4f}")

From the fraud example, precision is 7 divided by 12, which gives 0.583. That means 58.3% of the transactions the model flagged as fraud were actually fraud. The other 41.7% were false alarms. In a fraud prevention system, every false positive creates friction for a legitimate customer. High precision means your model is not crying wolf.

Precision becomes undefined when your model predicts zero positives. If TP plus FP equals zero, you are dividing by zero. Always check the confusion matrix before computing any metric.

What Recall Actually Measures

Recall answers a different question: of everything that was actually positive, how many did my model catch? The formula is TP divided by TP plus FN. This is also called the true positive rate or sensitivity.

recall = TP / (TP + FN)
print(f"Recall = {TP}/({TP}+{FN}) = {recall:.4f}")

From the fraud example, recall is 7 divided by 10, which gives 0.700. That means 70% of actual fraud was caught. The other 30% slipped through undetected. In fraud prevention, missed fraud means direct financial loss. In cancer screening, missed positives mean late-stage diagnosis. Recall is the metric that matters when missing a positive case is expensive or dangerous.

Recall becomes undefined when there are no actual positives in your dataset. If TP plus FN equals zero, recall is undefined. This happens with synthetic datasets or when sampling has removed all positive examples.

The Precision-Recall Tradeoff Is Real

You cannot maximize both simultaneously. Lowering the threshold for flagging something as positive catches more real positives (higher recall) but also generates more false alarms (lower precision). Raising the threshold makes your predictions more conservative, which improves precision but means you miss more real positives (lower recall).

The sweet spot depends entirely on your use case. For fraud prevention, I generally start with recall because a missed fraud transaction costs more than a false alarm. For email spam filtering, I favor precision because users get furious when legitimate email disappears. For medical diagnosis, recall is paramount. For a recommendation system, precision matters more because you want to surface relevant items, even if you miss some good ones.

import numpy as np

def precision_recall_at_threshold(y_true, y_proba, threshold):
    preds = (y_proba >= threshold).astype(int)
    TP = np.sum((preds == 1) & (y_true == 1))
    FP = np.sum((preds == 1) & (y_true == 0))
    FN = np.sum((preds == 0) & (y_true == 1))
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    return precision, recall

y_true = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 0])
y_proba = np.array([0.2, 0.85, 0.9, 0.3, 0.7, 0.4, 0.95, 0.88, 0.1, 0.35])

for thresh in [0.3, 0.5, 0.7, 0.9]:
    p, r = precision_recall_at_threshold(y_true, y_proba, thresh)
    print(f"Threshold {thresh}: Precision={p:.2f}, Recall={r:.2f}")

F1 Score: When You Need a Balance

F1 score is the harmonic mean of precision and recall. It penalizes extreme imbalance between the two. If one is very low, F1 will be low even if the other is perfect. The formula is 2 times precision times recall, divided by precision plus recall.

precision = 0.5833
recall = 0.7000
f1 = 2 * precision * recall / (precision + recall)
print(f"F1 = 2 * {precision:.4f} * {recall:.4f} / ({precision:.4f} + {recall:.4f}) = {f1:.4f}")

The harmonic mean is less forgiving than the arithmetic mean. If precision is 1.0 and recall is 0.5, the arithmetic mean is 0.75 but the F1 is 0.667. When stakeholders tell you they want a balance between precision and recall, F1 is usually what they mean. F1 is always less than or equal to the arithmetic mean of precision and recall.

Computing Metrics With scikit-learn

Computing these metrics by hand is educational. Using sklearn is what you do in production. The breast cancer dataset from sklearn gives you a clean binary classification problem to test on. The dataset ships with scikit-learn 1.6+ as of April 2026, no pip install needed.

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

data = datasets.load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
cm = confusion_matrix(y_test, predictions)

print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"Confusion Matrix:\n{cm}")
print(f"TN={cm[0][0]}, FP={cm[0][1]}, FN={cm[1][0]}, TP={cm[1][1]}")

Output:

Precision: 0.9815
Recall:    0.9815
F1 Score:  0.9815
Confusion Matrix:
[[61  2]
 [ 2 106]]
TN=61, FP=2, FN=2, TP=106

The breast cancer dataset is almost balanced, which is why precision and recall are nearly identical. In imbalanced datasets, they diverge sharply. Try running the same code on a fraud dataset where the positive class is the minority and you will see the tradeoff immediately.

If you are building a model from scratch instead of using sklearn, here is a guide to logistic regression from scratch that walks through the algorithm step by step.

Precision-Recall Curve in sklearn

The precision-recall curve shows how precision and recall change as you move the decision threshold across all possible values. It is one of the most useful diagnostic tools for binary classification problems, especially imbalanced ones.

import numpy as np
from sklearn.metrics import precision_recall_curve, average_precision_score

y_proba = model.predict_proba(X_test)[:, 1]

precision_vals, recall_vals, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

print(f"Average Precision: {avg_precision:.4f}")
print(f"Recall range:     {recall_vals.min():.4f} to {recall_vals.max():.4f}")
print(f"Threshold range:  {thresholds.min():.4f} to {thresholds.max():.4f}")

Average precision summarizes the curve into a single number. It is the weighted mean of precision at each threshold, weighted by the change in recall. It is a better single-number summary than F1 for imbalanced datasets because it accounts for the entire precision-recall tradeoff.

Beyond Binary Classification

Precision and recall extend naturally to multiclass problems through a one-vs-rest approach. For a three-class problem, you compute precision for class A by treating all A instances as positive and everything else as negative. The same process applies to each class separately.

from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

report = classification_report(y_test, preds, target_names=iris.target_names)
print(report)

The output gives precision, recall, and F1 for each class, plus micro, macro, and weighted averages. Micro averaging aggregates across all classes before computing. Macro averaging gives equal weight to each class regardless of frequency. Weighted averaging weights by class frequency.

For imbalanced datasets, weighted F1 is usually the most informative because it accounts for class frequency. Micro F1 is equivalent to accuracy in multiclass problems.

Cost-Sensitive Threshold Selection

In practice, the right threshold is not the one that maximizes F1. It is the one that minimizes your expected cost. If a false negative costs 10 times more than a false positive, you should tolerate lower precision to get higher recall.

def expected_cost(precision, recall, fn_cost=1.0, fp_cost=0.05, prevalence=0.01, n=10000):
    actual_positives = int(n * prevalence)
    actual_negatives = n - actual_positives

    TP = int(recall * actual_positives)
    FN = actual_positives - TP
    FP = int((1 - precision) / precision * TP) if precision > 0 else 0
    TN = actual_negatives - FP

    total_cost = FN * fn_cost + FP * fp_cost
    cost_per_tx = total_cost / n
    return cost_per_tx

cost1 = expected_cost(precision=0.9, recall=0.6, fn_cost=1.0, fp_cost=0.05, prevalence=0.01)
cost2 = expected_cost(precision=0.7, recall=0.85, fn_cost=1.0, fp_cost=0.05, prevalence=0.01)

print(f"High-precision threshold: ${cost1:.4f} per transaction")
print(f"High-recall threshold:    ${cost2:.4f} per transaction")
print(f"Savings from higher recall: ${cost1 - cost2:.4f} per transaction")

Running this with a fraud cost of 1.0, false positive cost of 0.05, and 1% prevalence typically shows that the higher-recall threshold is cheaper overall. In most fraud scenarios, the cost of a missed fraud case outweighs the cost of a false alarm by a large margin.

When to Use Which Metric

Precision matters more when the cost of a false positive is high. Spam filtering is the canonical example. If your model marks a legitimate email as spam, the user misses something important. Email is nearly free to process, so you err on the side of caution and keep more emails in the inbox even if it means lower recall.

Recall matters more when the cost of a false negative is high. Cancer screening, fraud detection, and fault detection in manufacturing all fit this category. A missed cancer case can be fatal. A missed fraud transaction costs money. A missed equipment failure can cause physical damage. In these cases, you tune for recall even if it means more false alarms.

F1 is the metric to use when precision and recall both matter and you need a single number to compare models. Kaggle competitions use it frequently. So do teams that need to report one number to leadership without explaining the tradeoff.

The Accuracy Trap

Going back to the fraud example: if 99% of transactions are legitimate, a model that predicts everything as legitimate achieves 99% accuracy. It is technically more accurate than a model that correctly identifies 70% of fraud but generates 5% false positives on legitimate transactions. The sophisticated model might have 94% accuracy and catch $2 million in fraud per year. The naive model catches nothing.

A confusion matrix reveals exactly how both models behave, and it is always the first thing I look at when evaluating a classifier. Accuracy is the metric that gets presented in board meetings because it sounds impressive. Precision and recall are the metrics that tell you whether your model is actually working.

FAQ

Share.
Leave A Reply