Every ML model, regardless of how it was trained or what framework built it, eventually does the same thing: it takes input and produces output. Python’s model.predict() operation is. It looks simple. It is simple-until it isn’t.
The same method name appears across scikit-learn, Keras, TensorFlow, PyTorch, XGBoost, LightGBM, and most other ML frameworks. But “predict” means slightly different things in each. The return shapes differ. The input expectations differ. The performance characteristics differ. And the ways it can fail differ too. This article covers what predict() actually does, how it behaves across the major frameworks, and the practical issues you’ll hit when running it in production.
TLDR
- predict() runs inference mode without updating weights
- sklearn returns class labels; Keras with activation returns probabilities; PyTorch requires model.eval() and torch.no_grad()
- Almost all frameworks require 2D input shape (n_samples, n_features) even for single samples
- predict_proba() gives class probabilities in sklearn, XGBoost, and LightGBM
- PyTorch has no built-in predict() method – call the model directly
What model.predict() Actually Does
predict() runs the model in inference mode. It passes your input data through the forward pass and returns the model’s predictions. Unlike fit() or train(), it does not update any weights. It is a pure computation.
At a high level:
1. The input is preprocessed and formatted to match what the model expects 2. The model runs its forward pass 3. Raw outputs (logits, probabilities, regression values) are returned
The critical thing to understand: predict() does not apply your final activation function in some frameworks, and it does in others. Confusion starts here.
scikit-learn: The Baseline
scikit-learn has the most consistent and predictable predict() behavior. It is the reference implementation that most other frameworks loosely follow.
Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(predictions.shape) # (200,)
print(predictions[:10]) # array([0, 1, 1, 0, 1, 0, 1, 1, 0, 0])
predict() returns a NumPy array of class labels. For binary classification, these are 0 or 1. For multiclass, these are integer indices.
Getting Probabilities in sklearn
If you want probabilities, you need predict_proba():
probabilities = model.predict_proba(X_test)
print(probabilities.shape) # (200, 2) — two classes
print(probabilities[:3])
# [[0.85, 0.15],
# [0.12, 0.88],
# [0.73, 0.27]]
Note that predict_proba() returns the probability for each class. The order matches model.classes_. If you need just the positive class probability in binary classification, use predict_proba(X_test)[:, 1].
Regression
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(predictions.shape) # (200,)
print(predictions[:5]) # array([2.34, -0.87, 1.56, 3.21, 0.12])
For regressors, predict() returns floating-point values directly. No separate method for raw scores vs. final output-the returned value is always the final prediction.
sklearn Summary
| Model Type | Return Shape | Return Type |
|---|---|---|
| Classifier | (n_samples,) |
Integer labels |
| Regressor | (n_samples,) |
Float values |
predict_proba() |
(n_samples, n_classes) |
Float probabilities |
Keras / TensorFlow: Classification Requires Sigmoid or Softmax
Keras is where most developers hit their first predict() surprise. The predict() method returns logits for classification models by default, not probabilities.
Binary Classification
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Binary classification model
model = Sequential([
Dense(64, activation='relu', input_shape=(20,)),
Dense(1, activation='sigmoid') # sigmoid on output layer
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=10, verbose=0)
# Raw predictions — still logits because of how Keras works
# Actually, sigmoid is applied, so you get probabilities
predictions = model.predict(X_test, verbose=0)
print(predictions.shape) # (n_samples, 1)
print(predictions[:5].flatten())
# [0.87, 0.12, 0.65, 0.91, 0.34]
Here’s the gotcha: if you build a binary classification model without an activation function on the final layer (i.e., you plan to apply sigmoid manually), then predict() returns raw logits. If you use activation='sigmoid' on the final layer, predict() returns probabilities.
# Without sigmoid on final layer — returns logits
model_logits = Sequential([
Dense(64, activation='relu', input_shape=(20,)),
Dense(1) # linear output — raw logits
])
model_logits.compile(optimizer='adam', loss='binary_crossentropy')
model_logits.fit(X_train, y_train, epochs=10, verbose=0)
raw_output = model_logits.predict(X_test, verbose=0)
# raw_output is logits, not probabilities
# Apply sigmoid manually to convert: 1 / (1 + np.exp(-raw_output))
Multiclass Classification
from tensorflow.keras.utils import to_categorical
# One-hot encode labels for multiclass
y_train_cat = to_categorical(y_train, num_classes=3)
y_test_cat = to_categorical(y_test, num_classes=3)
model = Sequential([
Dense(64, activation='relu', input_shape=(20,)),
Dense(3, activation='softmax') # softmax for multiclass
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(X_train, y_train_cat, epochs=10, verbose=0)
predictions = model.predict(X_test, verbose=0)
print(predictions.shape) # (n_samples, 3)
print(predictions[:3])
# [[0.05, 0.12, 0.83],
# [0.71, 0.22, 0.07],
# [0.33, 0.45, 0.22]]
With softmax on the final layer, predict() returns probabilities that sum to 1.0 across each row.
Using predict() with Models Without Output Activation
If you are doing custom training loops or using logits directly, you need to know how to handle raw outputs:
# Raw logits from a model without softmax
logits = model_logits.predict(X_test, verbose=0)
# Convert to probabilities
probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
# Or simply:
from scipy.special import softmax
probabilities = softmax(logits, axis=1)
predict() vs predict_on_batch()
Keras predict() is designed to handle large datasets by processing in batches internally. For small datasets, this overhead can actually slow things down. Use predict_on_batch() when you know your input size and want to avoid the batch-scheduling overhead:
# Standard predict — handles batching internally
predictions = model.predict(X_test, batch_size=32, verbose=1)
# Manual batch processing for small data
for i in range(0, len(X_test), 32):
batch = X_test[i:i+32]
batch_preds = model.predict_on_batch(batch)
predict() with verbose=1 shows a progress bar, which is useful for large datasets. predict_on_batch() has no progress output-it is a direct computation call.
PyTorch: No Built-In predict() Method
PyTorch does not have a model.predict() method. Developers coming from sklearn or Keras trip up here.
Instead, you put the model in evaluation mode and call the model directly:
import torch
import torch.nn as nn
class SimpleClassifier(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.sigmoid(self.fc2(x))
return x
model = SimpleClassifier(input_dim=20)
model.eval() # Critical: set to evaluation mode
# Inference
with torch.no_grad():
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
predictions = model(X_test_tensor)
print(predictions.shape) # (200, 1)
print(predictions[:5].numpy().flatten())
The eval() Mode Matters
Dropout layers, batch normalization, and other stochastic layers behave differently in training vs. evaluation. Always call model.eval() before inference:
model.eval() # Disables dropout, uses running stats for BatchNorm
with torch.no_grad(): # Disables gradient computation
predictions = model(X_test_tensor)
Common PyTorch Inference Patterns
# Batch inference
def predict_batch(model, X, batch_size=64):
model.eval()
predictions = []
with torch.no_grad():
for i in range(0, len(X), batch_size):
batch = torch.tensor(X[i:i+batch_size], dtype=torch.float32)
preds = model(batch)
predictions.append(preds.numpy())
return np.concatenate(predictions)
# CPU vs GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
with torch.no_grad():
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
predictions = model(X_test_tensor).cpu().numpy()
PyTorch with torch.compile (PyTorch 2.0+)
PyTorch 2.0 introduced torch.compile(), which JIT-compiles the model for faster inference:
model = SimpleClassifier(input_dim=20)
model.eval()
# Compile for ~20-30% speedup on inference
compiled_model = torch.compile(model)
with torch.no_grad():
predictions = compiled_model(X_test_tensor)
XGBoost and LightGBM: Native Gradient Boosting
XGBoost and LightGBM have their own predict() methods that behave similarly to sklearn but with important differences.
XGBoost
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)
# Default: returns class predictions
predictions = model.predict(X_test)
print(predictions.shape) # (200,)
# Probabilities
probabilities = model.predict_proba(X_test)
print(probabilities.shape) # (200, 2)
XGBoost with raw_score and pred_leaf
XGBoost exposes additional prediction types:
# Raw margin scores (before the global link function)
raw_scores = model.predict(X_test, output_margin=True)
# Leaf indices (useful for tree interpretation)
leaf_indices = model.predict(X_test, pred_leaf=True)
print(leaf_indices.shape) # (200, n_trees) — which leaf each tree puts the sample in
LightGBM
import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
# Raw scores
raw_scores = model.predict(X_test, raw_score=True)
# Leaf indices
leaf_preds = model.predict(X_test, pred_leaf=True)
Common Pitfalls Across All Frameworks
1. Input Shape Mismatches
The single most common error. model.predict() almost always expects 2D input (n_samples, n_features), even if you are predicting a single sample.
# Wrong — 1D array
single_sample = X_test[0]
predictions = model.predict(single_sample) # Shape mismatch error
# Correct — 2D array
single_sample = X_test[0:1] # Shape (1, n_features)
predictions = model.predict(single_sample)
Here is the tricky part: most frameworks can broadcast 1D to 2D in other contexts, but predict() is strict about shape.
2. Not Setting the Model to Evaluation Mode (PyTorch)
Dropout being active during inference will randomly zero out neurons, producing different outputs every call. Always:
model.eval() # Before inference
3. Forgetting That predict() Returns Indices, Not Probabilities (sklearn)
# Wrong assumption
if model.predict(X_test) > 0.5: # comparing array to scalar does element-wise comparison
...
# Correct — for binary classification
proba = model.predict_proba(X_test)[:, 1]
predictions = (proba > 0.5).astype(int)
4. Keras predict() Batching Overhead for Small Inputs
For small test sets, Keras predict() can be slower than expected due to internal batch scheduling:
# Slow for small data — batch scheduling overhead
predictions = model.predict(X_small, verbose=0)
# Faster for small data
predictions = model.predict_on_batch(X_small)
5. Ignoring the Dtype of Your Input
# If your training data was float32 but inference is float64
X_test_wrong = np.array(X_test, dtype=np.float64)
predictions = model.predict(X_test_wrong) # May work or may cast unexpectedly
# Ensure matching dtype
X_test_correct = np.array(X_test, dtype=np.float32)
predictions = model.predict(X_test_correct)
6. XGBoost/LightGBM Using Wrong Input Type After sklearn
sklearn models accept pandas DataFrames. XGBoost and LightGBM often work better with their native data structures for large datasets:
import xgboost as xgb
# DMatrix is XGBoost's native data structure — faster for large data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=100)
predictions = model.predict(dtest) # Note: different API — model is Booster, not Classifier
Batch Prediction Performance
When you need to predict on large datasets, how you batch matters:
def batch_predict(model, X, framework='sklearn', batch_size=1000):
n_samples = len(X)
predictions = []
for start in range(0, n_samples, batch_size):
end = min(start + batch_size, n_samples)
batch = X[start:end]
if framework == 'sklearn':
preds = model.predict(batch)
elif framework == 'keras':
preds = model.predict(batch, verbose=0)
elif framework == 'pytorch':
with torch.no_grad():
batch_tensor = torch.tensor(batch, dtype=torch.float32)
preds = model(batch_tensor).numpy()
predictions.append(preds)
return np.concatenate(predictions)
Key points:
- sklearn: internal batching is usually sufficient, pass the whole array
- Keras:
batch_sizeparameter inpredict()controls internal batching; set it based on your memory constraints - PyTorch: manual batching gives you full control
What About predict_proba() and Other Variants?
Frameworks typically provide variant methods:
| Method | Returns | Available In |
|---|---|---|
predict() |
Class labels (sklearn) or probabilities (Keras with activation) | All |
predict_proba() |
Class membership probabilities | sklearn, Keras (wraps predict), XGBoost, LightGBM |
predict_log_proba() |
Log probabilities | sklearn |
predict_on_batch() |
Same as predict, explicit batch | Keras |
predict_async() |
Async version | Some frameworks (e.g., TensorFlow.js) |
Use predict_proba() when you need the uncertainty of a prediction, not just the label. These methods are essential for:
- Threshold tuning (choosing your own classification threshold)
- Calibrated probabilities
- Ensemble methods that weight predictions by confidence
Putting It Together: A Framework-Agnostic predict() Wrapper
If you are working with multiple frameworks in the same codebase, a thin wrapper can smooth over the differences:
import numpy as np
def predict(model, X, framework='sklearn', proba=False):
X = np.asarray(X)
if X.ndim == 1:
X = X.reshape(1, -1) # Ensure 2D
if framework == 'sklearn':
if proba:
return model.predict_proba(X)
return model.predict(X)
elif framework == 'keras':
preds = model.predict(X, verbose=0)
if proba:
return preds
return (preds > 0.5).astype(int).flatten()
elif framework == 'pytorch':
model.eval()
with torch.no_grad():
X_tensor = torch.tensor(X, dtype=torch.float32)
preds = model(X_tensor).numpy()
if proba:
return preds
return (preds > 0.5).astype(int).flatten()
elif framework in ('xgboost', 'lightgbm'):
if proba:
return model.predict_proba(X)
return model.predict(X)
else:
raise ValueError(f"Unknown framework: {framework}")
The Core Principle
model.predict() is a framework-specific inference call that:
- Takes your preprocessed input data
- Runs the forward pass without updating weights
- Returns predictions in framework-specific format (labels, probabilities, or raw scores)
FAQ
Q: Does model.predict() update weights?
No. predict() runs inference mode, which is a pure forward pass with no weight updates. Only fit(), train(), or backward() operations change model parameters.
Q: Why does Keras predict() return different values than sklearn?
sklearn always returns class labels (0 or 1) for classifiers. Keras returns raw values that depend on your output layer activation. If you used sigmoid, you get probabilities. If you used no activation, you get logits and must apply sigmoid manually.
Q: Why does my PyTorch model give different outputs every time I call it?
You probably forgot model.eval() or are still inside a torch.enable_grad() context. Dropout and certain other layers behave differently in training mode. Always call model.eval() and use torch.no_grad() for inference.
Q: Can I use predict() on a single sample?
Yes, but you must pass a 2D array with shape (1, n_features), not a 1D array. Single samples need X[0:1] not X[0] for most frameworks.
Q: What is the difference between predict() and predict_proba()?
predict() returns class labels (for classifiers) or values (for regressors). predict_proba() returns class membership probabilities. Use predict_proba() when you need confidence scores or want to tune your own classification threshold.
The surface similarity across frameworks masks important differences in return types, input shape requirements, and behavior in training vs. evaluation mode. Understanding these differences is what separates code that works in a notebook from code that works in production.

