I still remember the moment when I was building a regression model and noticed something strange. The average of my squared predictions was always higher than the square of my average prediction. At first I thought I had a bug. Then I realized I was watching Jensen’s inequality in action.
In this article I walk through Jensen’s inequality step by step, explain why the geometric intuition makes sense, verify it with Python code, and show where it appears across machine learning, statistics, and information theory.
TLDR
- For convex functions: E[g(X)] >= g(E[X]). The expectation of a transformation exceeds the transformation of the expectation.
- For concave functions the inequality flips: E[g(X)] <= g(E[X]).
- Geometrically, convex functions curve upward so the chord connecting two points sits above the function itself.
- Variance is always non-negative because x squared is convex, giving E[X^2] >= (E[X])^2.
- The inequality surfaces in cross-entropy, KL divergence, and convex optimization.
What is Jensen’s Inequality?
Jensen’s inequality is deceptively simple. It tells me that when I apply a convex function to an average, I always get something smaller than or equal to averaging the function’s values at each point first. The mathematical form looks like this: E[g(X)] >= g(E[X]) for convex functions, with the inequality flipping when the function curves the other way.
You might encounter this in a statistics class, an economics problem, or even while debugging a machine learning model that refuses to converge. By the end of this article, I will have shown you why the inequality holds geometrically, how to verify it numerically in Python, and where it appears across data science and engineering domains.
The Geometric Intuition
A convex function curves upward. Imagine drawing any chord between two points on the curve. That chord always sits above the curve itself. This property is what drives Jensen’s inequality.
When I take the average of inputs and then transform, I am essentially evaluating the function at a single point (the mean). When I transform each input individually and then average, I am effectively averaging across the curve. Because the curve bends upward, the chord connecting the transformed endpoints sits above the function at the mean, which gives us E[g(X)] >= g(E[X]).
The mathematical representation uses weights lambda that sum to one:
g(sum lambda_i * x_i) <= sum lambda_i * g(x_i) # for convex g
If the lambdas are all equal (each being 1/n), this simplifies to the familiar form involving expectations.
Verifying Jensen’s Inequality
Pure Python Implementation
I find it helpful to verify the inequality with a simple example using x squared as my convex function. The values 1 through 5, each weighted equally, give me a straightforward test case.
def jensens_inequality(values, weights):
E_x = sum(v * w for v, w in zip(values, weights))
E_g_x = sum(g(v) * w for v, w in zip(values, weights))
g_E_x = g(E_x)
return E_g_x, g_E_x
def g(x):
return x ** 2
values = [1, 2, 3, 4, 5]
weights = [1/len(values)] * len(values)
E_g_x, g_E_x = jensens_inequality(values, weights)
print(f"E[g(x)] = {E_g_x}")
print(f"g(E[x]) = {g_E_x}")
print(f"E[g(x)] >= g(E[x]): {E_g_x >= g_E_x}")
E[g(x)] = 11.0
g(E[x]) = 9.0
E[g(x)] >= g(E[x]): True
The result confirms what the inequality predicts. The average of squared values (11.0) is greater than or equal to the square of the average (9.0). This holds because squaring is convex.
NumPy Implementation
NumPy makes these calculations vectorized and efficient. I can use the exponential function as my convex function and visualize the relationship at the same time. The exponential function is particularly interesting because it appears in probability theory as the distribution of waiting times.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
weights = np.array([1/len(x)] * len(x))
E_exp_x = np.sum(np.exp(x) * weights)
exp_E_x = np.exp(np.sum(x * weights))
print(f"E[exp(x)] = {E_exp_x}")
print(f"exp(E[x]) = {exp_E_x}")
print(f"E[exp(x)] >= exp(E[x]): {E_exp_x >= exp_E_x}")
plt.figure(figsize=(8, 5))
x_smooth = np.linspace(0.5, 5.5, 100)
plt.plot(x_smooth, np.exp(x_smooth), label='exp(x)', color='blue')
plt.scatter(x, np.exp(x), color='blue', zorder=5)
plt.axvline(x=np.sum(x * weights), color='red', linestyle='--', label='E[x]')
plt.axhline(y=exp_E_x, color='green', linestyle=':', alpha=0.7)
plt.axhline(y=E_exp_x, color='orange', linestyle=':', alpha=0.7)
plt.scatter([np.sum(x * weights)], [exp_E_x], color='green', s=100, zorder=6, label='g(E[x])')
plt.scatter([np.sum(x * weights)], [E_exp_x], color='orange', s=100, zorder=6, label='E[g(x)]')
plt.title("Jensen's Inequality Visualization")
plt.legend()
plt.tight_layout()
plt.savefig('/tmp/jensens_inequality_plot.png', dpi=150)
print("Plot saved to /tmp/jensens_inequality_plot.png")
E[exp(x)] = 46.64083679725964
exp(E[x]) = 20.085536923187668
E[exp(x)] >= exp(E[x]): True
Plot saved to /tmp/jensens_inequality_plot.png
The exponential function grows faster as inputs increase, which means averaging after transformation produces a larger value than transforming after averaging. This is the essence of convexity at work.
Variance and Jensen’s Inequality
One of the cleanest consequences of Jensen’s inequality is the non-negativity of variance. Let me show how this works.
values = np.array([1, 2, 3])
probs = np.array([1/3, 1/3, 1/3])
E_x = np.sum(values * probs)
Var_x = np.sum(probs * (values - E_x) ** 2)
print(f"E[X] = {E_x}")
print(f"Var[X] = {Var_x}")
print(f"Variance is always non-negative: {Var_x >= 0}")
E[X] = 2.0
Var[X] = 0.6666666666666666
Variance is always non-negative: True
Since x squared is convex, Jensen’s inequality tells me that E[X^2] >= (E[X])^2. The difference between these two quantities is exactly the variance. Because the inequality guarantees E[X^2] is always at least (E[X])^2, variance can never dip below zero.
Concave Functions: The Flipped Inequality
Concave functions curve downward, which means the chord between two points lies below the curve. This flips the inequality direction. The square root function provides a clear example.
values = np.array([1, 4, 9])
probs = np.array([1/3, 1/3, 1/3])
E_sqrt_x = np.sum(np.sqrt(values) * probs)
sqrt_E_x = np.sqrt(np.sum(values * probs))
print(f"E[sqrt(X)] = {E_sqrt_x}")
print(f"sqrt(E[X]) = {sqrt_E_x}")
print(f"E[sqrt(X)] <= sqrt(E[X]): {E_sqrt_x <= sqrt_E_x}")
E[sqrt(X)] = 2.0
sqrt(E[X]) = 2.1602468994692865
E[sqrt(X)] <= sqrt(E[X]): True
The inequality reverses as expected. Averaging square roots (2.0) gives a smaller result than taking the square root of the average (2.16). This matters when working with functions like log, which is concave, or when analyzing risk measures.
Applications in Data Science
Jensen’s inequality is not just a theoretical curiosity. It surfaces across many practical domains in data science and machine learning.
Information Theory
The concept of entropy relies on the concave log function. The cross-entropy in Python relationship between distributions follows Jensen’s bounds. The Kullback-Leibler divergence, which measures the difference between two probability distributions, remains non-negative because of this inequality. This property makes KL divergence a valid distance measure in information theory.
Machine Learning Loss Functions
Many convex loss functions in machine learning are bound by Jensen’s inequality. When training a predictive model, the relationship between empirical risk and expected risk is governed by this inequality. The maximum likelihood estimator benefits from these properties during optimization.
Signal Processing and Finance
Smoothing signals using moving averages or expected values often reduces peak values when processed through convex functions. A radio signal passed through a square law detector will have an average output higher than the square of the average input. This effect also appears in portfolio optimization, where expected returns under convex risk measures follow Jensen’s bounds.
FAQ
What is the formal definition of Jensen’s inequality?
For a convex function g and weights lambda_i that sum to one: g(sum lambda_i x_i) less than or equal to sum lambda_i g(x_i). When lambdas represent probabilities, this becomes E[g(X)] greater than or equal to g(E[X]).
How does Jensen’s inequality relate to variance?
Since x squared is convex, Jensen’s inequality gives E[X^2] greater than or equal to (E[X])^2. The difference E[X^2] minus (E[X])^2 equals Var[X], proving variance is always non-negative.
What happens when the function is concave instead of convex?
The inequality direction reverses. For concave g: E[g(X)] less than or equal to g(E[X]). The square root and logarithm functions are common concave examples.
Is Jensen’s inequality only about averages?
Averages (arithmetic means) are the most common case with equal weights. However, the inequality applies to any weighted average where weights sum to one, including integrals over continuous distributions. Any time you combine multiple values with weights that sum to one, Jensen’s inequality governs how convex transformations behave.
Why is Jensen’s inequality important in machine learning?
Many loss functions are convex, and the relationship between training error and generalization error involves Jensen’s inequality. It helps bound expected loss and justifies regularization techniques. When optimizing neural networks, the gap between training loss and validation loss often reflects Jensen’s effects.
How does Jensen’s inequality connect to the AM-GM inequality?
The arithmetic mean – geometric mean inequality is a special case of Jensen’s inequality. Since log is concave, applying Jensen to log of products gives AM >= GM. This connection appears frequently when analyzing geometric averages and growth rates.
I first encountered Jensen’s inequality while trying to understand why my model kept underperforming on validation data despite showing improving training scores. The geometric intuition stuck with me. Whether you are analyzing text data, building a predictive model, or just working with data structures, this inequality quietly governs how information flows through transformations.
