The first time I tried to debug a misbehaving neural network, I spent hours staring at accuracy numbers that looked reasonable but hid a fundamentally broken model. What I needed was a way to actually measure how confident the model was about its predictions, not just whether it got the label right. That is exactly what binary cross-entropy loss gives you.

This article covers what binary cross-entropy loss is, how the math works, how to implement it from scratch in Python, and how to use it with PyTorch and Keras. By the end, you will understand why it is the default loss for binary classification problems.

TLDR

  • Binary cross-entropy measures how close predicted probabilities are to the true labels
  • It penalizes confident wrong predictions heavily, which drives better model calibration
  • The formula is -[y * log(p) + (1-y) * log(1-p)]
  • PyTorch and Keras both provide built-in implementations
  • You can implement it from scratch in about 10 lines of NumPy

What is Binary Cross-Entropy Loss?

Binary cross-entropy loss, also known as log loss, is a loss function used in binary classification problems. It compares the predicted probability p that an input belongs to the positive class against the true label y, which is either 0 or 1. The sum of these comparisons across all samples gives you the total training loss.

The function is defined as:


L = -(y * log(p) + (1 - y) * log(1 - p))


If y=1, p=0.9  → L = -log(0.9)  = 0.105
If y=1, p=0.1  → L = -log(0.1)  = 2.302
If y=0, p=0.1  → L = -log(0.9)  = 0.105
If y=0, p=0.9  → L = -log(0.1)  = 2.302

When the true label y is 1, the loss simplifies to -log(p). When y is 0, it becomes -log(1-p). This means the loss is low when the model predicts a high probability for the correct class, and high when it predicts a low probability or, worse, a high probability for the wrong class.

The key insight is that binary cross-entropy does not just check whether a prediction is correct. It penalizes the model proportionally to how wrong it was. A model that predicts 0.9 for a positive sample when the true label is 1 contributes a loss of -log(0.9) = 0.05, which is small. But a model that predicts 0.1 for that same sample contributes -log(0.1) = 2.3, which is 46 times larger. Confident mistakes cost more.

This property makes binary cross-entropy especially useful because it drives models toward producing well-calibrated probabilities. A well-calibrated model not only predicts the correct class but also assigns high confidence to correct predictions and low confidence to incorrect ones.

Implementing Binary Cross-Entropy from Scratch

Before reaching for a framework, it helps to see how straightforward the implementation is. Here is a pure NumPy version that mirrors exactly what happens under the hood in any deep learning library. This is essentially what a Python function in a training loop calls every forward pass.


import numpy as np

def binary_cross_entropy(y_true, y_pred):
    # Clip predictions to avoid log(0)
    epsilon = 1e-7
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )

# Example: y_true = 1 (positive class), y_pred = 0.8 (80% confidence)
y_true = np.array([1, 0, 1, 0])
y_pred = np.array([0.8, 0.2, 0.7, 0.3])

loss = binary_cross_entropy(y_true, y_pred)
print(f"Binary cross-entropy loss: {loss:.4f}")


Binary cross-entropy loss: 0.2682

The epsilon clipping is important. If you pass a prediction of exactly 0 or exactly 1 into the log function, you get negative infinity. Clipping keeps predictions in (0, 1) and ensures the loss stays numerically stable. This is the same fix you need when handling ValueError in Python code that tries to parse edge-case values.

Binary Cross-Entropy with PyTorch

PyTorch provides binary cross-entropy loss through the torch.nn.BCELoss class. It works directly on raw logits when combined with BCEWithLogitsLoss, which applies the sigmoid internally and is more numerically stable than applying sigmoid separately.


import torch
import torch.nn as nn

# BCEWithLogitsLoss combines sigmoid + BCE in one numerically stable call
criterion = nn.BCEWithLogitsLoss()

# Raw logits (before sigmoid) from the model
logits = torch.tensor([[1.2], [-0.5], [0.8], [-1.0]])
# True labels
y_true = torch.tensor([[1.0], [0.0], [1.0], [0.0]])

loss = criterion(logits, y_true)
print(f"BCEWithLogitsLoss: {loss.item():.4f}")


BCEWithLogitsLoss: 0.5769

The logits are raw model outputs that can be any real number. The loss function applies sigmoid internally before computing the cross-entropy. This is the recommended approach for PyTorch binary classification models. Each element in the logits tensor represents the raw prediction for one sample.

Binary Cross-Entropy with Keras

In Keras, you set loss="binary_crossentropy" when compiling your model. The framework handles the rest, including the sigmoid activation in the output layer that squashes raw scores into the [0, 1] range. The sigmoid activation function is critical here because it converts arbitrary real-valued logits into probabilities.


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

model = Sequential([
    Dense(16, input_dim=8, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

# Assuming X_train, y_train, X_val, y_val are already defined
# model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))
print("Model compiled with binary_crossentropy loss")


Model compiled with binary_crossentropy loss

The output layer uses sigmoid activation to produce a probability between 0 and 1. The binary_crossentropy loss then compares that probability against the true label. For multi-class problems, Keras uses categorical_crossentropy instead.

Why Not Just Use Accuracy?

Accuracy counts correct predictions but ignores how confident the model was. A model that predicts 0.51 for a positive sample and 0.49 for a negative sample will get both right by a tiny margin, but the decision boundary is essentially a coin flip. Binary cross-entropy captures this uncertainty. It rewards models that are not just correct but also confident when they should be.

There is a direct relationship between binary cross-entropy and a concept from information theory called entropy. The cross-entropy between two distributions measures the average number of bits needed to encode events from the true distribution using a model distribution. Binary cross-entropy is simply the special case for two outcomes. When the model perfectly matches the true distribution, cross-entropy equals entropy, the minimum possible loss.

FAQ

Q: What is the difference between BCELoss and BCEWithLogitsLoss in PyTorch?

BCEWithLogitsLoss combines sigmoid activation with binary cross-entropy in a single numerical stable operation. BCELoss expects inputs that are already probabilities (outputs of sigmoid). Using BCEWithLogitsLoss avoids potential numerical instability from computing sigmoid and log separately.

Q: When should I use binary cross-entropy instead of categorical cross-entropy?

Binary cross-entropy is for binary classification problems with two possible classes. Categorical cross-entropy is for multi-class problems. If there are more than two classes, use categorical cross-entropy with a softmax output layer.

Q: Why is my binary cross-entropy loss giving NaN?

The most common cause is passing a prediction of exactly 0 or exactly 1 into the log function, which produces negative infinity. Use epsilon clipping on predictions before computing the loss, or use BCEWithLogitsLoss which handles this internally.

Q: Can binary cross-entropy be used for multi-label classification?

Yes. Multi-label classification treats each label as an independent binary problem, so you apply sigmoid activation (not softmax) and binary cross-entropy loss independently per label. This is different from multi-class, which uses softmax and categorical cross-entropy.

Q: How does binary cross-entropy relate to logistic regression?

Logistic regression uses binary cross-entropy loss as its objective function. The model learns weights that minimize the cross-entropy between the predicted probability distribution and the true label distribution. Minimizing this loss is equivalent to maximizing the likelihood of the observed data under the logistic model.

Share.
Leave A Reply