I have used sklearn’s train_test_split more times than I can count. But every now and then I run into a situation where I cannot install sklearn – maybe it is a restricted environment, maybe I am working on a pure NumPy project, or maybe I just want to understand what is happening under the hood. In those moments, I reach for a manual approach. Let me show you exactly how to split data into training and testing sets in plain Python, without sklearn.

Manual splitting is not complicated. At its core, you are just dividing a list or array into two parts based on a ratio. The tricky part comes when you need to shuffle the data first, handle both features (X) and labels (y) together so they stay aligned, and possibly make the split reproducible. I will walk through all of that here.

TLDR

  • Manual train/test split works by slicing arrays based on a ratio like 0.8
  • Shuffling indices before splitting keeps X and y aligned
  • Use random.seed() or random.shuffle() for reproducible splits
  • NumPy indexing with fancy indices handles the split cleanly
  • You can wrap this logic in a reusable function that mimics train_test_split

Why Split Data Manually?

When you are learning machine learning, using sklearn is the right call – it is well-tested and efficient. But there are cases where manual splitting makes more sense. Perhaps you are building a custom pipeline that does not fit sklearn’s API. Perhaps you are working in an environment where adding dependencies is not worth the overhead. Or perhaps – like me – you simply want to understand what train_test_split actually does before relying on it blindly.

I keep coming back to this idea: if you cannot recreate it from scratch, you do not fully understand it. So let me show you how to split data by hand.

Step 1: Create Sample Data

Let me set up some sample data so we have something to split. I will use a simple dataset with 10 rows and a couple of features.


import random

# Sample data: 10 rows, 2 features (X) and 1 label (y)
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10],
     [11, 12], [13, 14], [15, 16], [17, 18], [19, 20]]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

print("X:", X)
print("y:", y)

Explain code: X is a list of 10 two-element lists representing two features per row. y is a list of 10 labels (0 or 1). They are stored separately but correspond by index – X[0] belongs to y[0], and so on.

Output:


X: [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16], [17, 18], [19, 20]]
y: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

Step 2: Shuffle Indices Before Splitting

Raw data is often ordered in ways that bias your model. If all the “1” labels happen to be at the end of your dataset, a random split is essential. The key is to shuffle indices and apply the same shuffled index order to both X and y so they stay aligned.


# Set seed for reproducibility
random.seed(42)

# Create a list of indices
n_samples = len(X)
indices = list(range(n_samples))

# Shuffle the indices
random.shuffle(indices)

print("Shuffled indices:", indices)

Explain code: random.seed(42) makes the shuffle reproducible – run this code again and you get the same order. indices holds [0, 1, 2, …, 9]. random.shuffle() scrambles those indices in place.

Output:


Shuffled indices: [7, 2, 8, 3, 5, 1, 9, 0, 4, 6]

Step 3: Split Based on a Ratio

Once you have shuffled indices, the split is straightforward. Decide on a ratio – I commonly use 80:20 (test_size=0.2) – and cut the shuffled index list at that point. Then use integer division to find the split boundary.


# 80% training, 20% testing
test_size = 0.2
split_idx = int(n_samples * (1 - test_size))

train_indices = indices[:split_idx]
test_indices = indices[split_idx:]

print("Training indices:", train_indices)
print("Testing indices:", test_indices)
print("Training samples:", len(train_indices))
print("Testing samples:", len(test_indices))

Explain code: split_idx = int(10 * 0.8) = 8. Training indices are the first 8 shuffled indices, testing indices are the last 2. This gives you an 80:20 split.

Output:


Training indices: [7, 2, 8, 3, 5, 1, 9, 0]
Testing indices: [4, 6]
Training samples: 8
Testing samples: 2

Step 4: Apply Split to X and y

Now use the split indices to build your X_train, X_test, y_train, y_test arrays. Since our data is in Python lists, we use list comprehensions to pick out the right rows.


X_train = [X[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
y_train = [y[i] for i in train_indices]
y_test = [y[i] for i in test_indices]

print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

Explain code: For each index in train_indices, we pick the corresponding row from X and label from y. List comprehensions handle this cleanly. X_test and y_test come from the remaining test_indices.

Output:


X_train: [[15, 16], [5, 6], [17, 18], [7, 8], [11, 12], [3, 4], [19, 20], [1, 2]]
X_test: [[9, 10], [13, 14]]
y_train: [1, 0, 0, 1, 1, 1, 0, 0]
y_test: [0, 0]

Step 5: Reusable Function

I have done this often enough that I wrapped it into a reusable function. This mimics the basic signature of sklearn’s train_test_split – you pass X, y, test_size, and optionally a random seed.


import random

def manual_train_test_split(X, y, test_size=0.2, random_seed=None):
    if random_seed is not None:
        random.seed(random_seed)

    n_samples = len(X)
    indices = list(range(n_samples))
    random.shuffle(indices)

    split_idx = int(n_samples * (1 - test_size))
    train_indices = indices[:split_idx]
    test_indices = indices[split_idx:]

    X_train = [X[i] for i in train_indices]
    X_test = [X[i] for i in test_indices]
    y_train = [y[i] for i in train_indices]
    y_test = [y[i] for i in test_indices]

    return X_train, X_test, y_train, y_test

# Example with a larger dataset
X_large = [[i, i+1] for i in range(100)]
y_large = [i % 2 for i in range(100)]

X_tr, X_te, y_tr, y_te = manual_train_test_split(X_large, y_large, test_size=0.2, random_seed=7)

print("Training set size:", len(X_tr))
print("Testing set size:", len(X_te))
print("First 3 X_train rows:", X_tr[:3])

Explain code: manual_train_test_split() takes the same parameters as sklearn with the same names. Inside, it creates shuffled indices, splits them, then uses list comprehensions to build the output arrays. With random_seed=7 the split is reproducible across runs.

Output:


Training set size: 80
Testing set size: 20
First 3 X_train rows: [[26, 27], [13, 14], [62, 63]]

Using NumPy for Cleaner Splitting

If you are working with NumPy arrays, the code gets even shorter. NumPy fancy indexing lets you split arrays directly using the shuffled index array – no list comprehensions needed.


import numpy as np

X_np = np.array([[i, i+1] for i in range(100)])
y_np = np.array([i % 2 for i in range(100)])

np.random.seed(99)
shuffled_idx = np.random.permutation(len(X_np))
split_at = int(len(X_np) * 0.8)

train_idx = shuffled_idx[:split_at]
test_idx = shuffled_idx[split_at:]

X_train_np = X_np[train_idx]
X_test_np = X_np[test_idx]
y_train_np = y_np[train_idx]
y_test_np = y_np[test_idx]

print("NumPy X_train shape:", X_train_np.shape)
print("NumPy X_test shape:", X_test_np.shape)
print("First 3 training rows:\n", X_train_np[:3])

Explain code: np.random.permutation() gives a shuffled index array directly. Using fancy indexing X_np[train_idx] selects all rows at those indices in one operation. The result is clean, fast, and closer to what sklearn does internally.

Output:


NumPy X_train shape: (80, 2)
NumPy X_test shape: (20, 2)
First 3 training rows:
 [[56 57]
 [13 14]
 [62 63]]

FAQ

Q: Why should I shuffle before splitting?

Most real datasets have some ordering in them – temporal data is sorted by date, iris flowers are sorted by species, and so on. If you split without shuffling, your training set might contain only a subset of categories or time periods. This leads to a model that does not generalize. Shuffling first ensures both sets contain a representative mix of your data.

Q: What test size should I use?

80:20 is the most common split in tutorials because it is easy to reason about. 70:30 or 75:25 are also common. The right size depends on how much data you have – with 10,000 rows, a 20% test set (2,000 samples) is usually plenty. With only 100 rows, a 20% test set (20 samples) may be too small to give reliable accuracy estimates. In general, use enough test samples to represent the population accurately.

Q: How is this different from sklearn train_test_split?

sklearn’s train_test_split does the same thing but with more features: stratify parameter for class-balanced splits, multiple arrays can be passed at once, and it returns everything in a single call. The core logic – shuffle indices, split, apply to arrays – is exactly what I showed above. Now that you have seen the internals, sklearn is just a convenience wrapper around this pattern.

Q: Can I use this for cross-validation?

For simple train/test splitting, yes. For k-fold cross-validation, you would split the shuffled indices into k roughly equal parts and iterate through them. sklearn’s KFold and StratifiedKFold handle this automatically, but the manual equivalent is just slicing your shuffled index array into k chunks and looping.

Q: What about stratification (preserving class balance)?

Stratification means ensuring each fold has roughly the same proportion of each class. Doing this manually is more involved – you need to track indices per class, shuffle within each class, then sample proportionally from each class for both sets. sklearn’s stratify parameter handles this automatically. For imbalanced datasets, this step matters a lot.

Share.
Leave A Reply