A Practitioner's Guide to Cross-Validation in Finance

Standard k-fold cross-validation can be dangerously misleading when applied to financial time-series data. Here's why, and what to do about it.

The Problem with Standard CV

Financial data has three properties that violate the assumptions of standard cross-validation:

Serial correlation -- Today's returns are correlated with yesterday's returns

Non-stationarity -- The data distribution changes over time (regime shifts)

Information leakage -- Features computed from a rolling window can leak future information

Better Alternatives

Purged Walk-Forward CV

from sklearn.model_selection import TimeSeriesSplit

def purged_walk_forward(X, y, n_splits=5, embargo_pct=0.01):
"""
Walk-forward CV with purging and embargo.
Purging removes samples near the train/test boundary.
Embargo prevents using recently trained-on data.
"""
tscv = TimeSeriesSplit(n_splits=n_splits)
embargo_size = int(len(X) * embargo_pct)


for train_idx, test_idx in tscv.split(X):
# Purge: remove overlap
train_idx = train_idx[:-embargo_size]
yield train_idx, test_idx

Combinatorial Purged CV (CPCV)

Introduced by Marcos Lopez de Prado, CPCV generates multiple train/test paths through the data, providing more reliable backtests with fewer data points.

Key Takeaways

Never use random k-fold CV on time-series data
Always purge samples near train/test boundaries
Apply an embargo period proportional to your feature lookback
Use multiple paths through the data to reduce variance of your performance estimate

Financial ML requires financial ML techniques. Don't trust generic tools blindly.