A Practitioner's Guide to Cross-Validation in Finance
A Practitioner's Guide to Cross-Validation in Finance
Standard k-fold cross-validation can be dangerously misleading when applied to financial time-series data. Here's why, and what to do about it.
The Problem with Standard CV
Financial data has three properties that violate the assumptions of standard cross-validation:
Better Alternatives
Purged Walk-Forward CV
from sklearn.model_selection import TimeSeriesSplit
def purged_walk_forward(X, y, n_splits=5, embargo_pct=0.01):
"""
Walk-forward CV with purging and embargo.
Purging removes samples near the train/test boundary.
Embargo prevents using recently trained-on data.
"""
tscv = TimeSeriesSplit(n_splits=n_splits)
embargo_size = int(len(X) * embargo_pct)
for train_idx, test_idx in tscv.split(X):
# Purge: remove overlap
train_idx = train_idx[:-embargo_size]
yield train_idx, test_idx
Combinatorial Purged CV (CPCV)
Introduced by Marcos Lopez de Prado, CPCV generates multiple train/test paths through the data, providing more reliable backtests with fewer data points.
Key Takeaways
- Never use random k-fold CV on time-series data
- Always purge samples near train/test boundaries
- Apply an embargo period proportional to your feature lookback
- Use multiple paths through the data to reduce variance of your performance estimate