Testing a trading system like a scientist

Most backtests are a story you tell yourself. I wanted something that would tell me when I was wrong.

The problem with standard backtesting is that it's too easy to find edge in historical data. Enough parameters, enough trials, and you'll find a configuration that looks good. This is curve fitting, not discovery. The strategy won't trade forward the way it traded backward.

Walk-forward validation is the fix. You train on a window, test on the next window, roll forward. If the edge is real, it should appear consistently across windows. If it only shows up in the training windows, it's noise that fit your data.

The framework

I built the walk-forward backtester as a separate system with injected dependencies — market data, signal function, position sizer. The test windows are configurable but the default is 90-day training, 30-day test, rolling forward one week at a time. Each run produces a distribution of Sharpe ratios, win rates, and max drawdowns across windows.

I also added property-based testing using Hypothesis. For any mathematical function in the system — Kelly fraction, ATR calculation, position sizing — I define invariants and let Hypothesis find counterexamples. Kelly fraction must always return between 0 and 1. Position size must always be at least minimum notional. ATR must always be positive. These properties should hold for any valid input, not just the inputs I thought to test.

What the framework caught

Three strategies that passed standard backtesting failed walk-forward. One had a Sharpe of 1.8 in-sample and -0.3 out-of-sample consistently. Another was time-of-day dependent in a way I hadn't noticed — it traded well in the training windows, which happened to overlap with higher-volatility hours, and poorly in the test windows that didn't.

The property tests found two edge cases in the Kelly calculator: negative win-loss ratio (possible if configured wrong) and win rate exactly equal to 1.0 (divides by zero in the simplified formula). Both were silent failures before — they returned garbage values. Now they raise early.

The honest thing about this kind of testing is that it mostly produces disappointing results. Most strategies don't hold up. That's the point.