When the data is lying to you

A backtest that passes on bad data isn't a backtest. It's a confidence builder that will cost you money.

I discovered this the hard way. My trading engine was showing edges I couldn't replicate in paper mode. The strategies looked good. The data looked fine. The problem was that I hadn't defined what "fine" meant.

The gap

OHLCV data from crypto APIs is noisy. Exchanges have clock drift. Some candles get dropped and silently reconstructed. Outlier spikes appear in low-liquidity windows that would never fill in practice. None of this shows up as an error. The data just quietly lies to you.

I was training signal detectors on this data and getting excited about returns that were partially explained by artifacts. The gap between backtest and paper trading was the gap between the clean story and the real one.

What I built

The data truth gate sits before any strategy sees a candle. It cross-validates prices across sources, runs outlier detection on returns, and compares against web-verified bounds. If a candle doesn't pass, it doesn't enter the pipeline. The system hard-halts when reconciliation fails — it doesn't silently continue with degraded data.

The invariant that mattered most: closing prices must be within 2% of the source at the time of the bar. Sounds obvious. Wasn't enforced anywhere.

What it caught

Three strategies that looked profitable in backtests were unprofitable when run on validated data. One had a Sharpe of 1.3 on dirty data and 0.2 on clean. That's not a good strategy with some noise — that's a strategy that was entirely explained by the noise.

I also found that two data sources I was treating as equivalent had divergence above 3% on about 8% of candles. Not random divergence — systematic, correlated with low-volume hours. One of them was wrong. I still don't know which one.

The gate is never turned off. If the data can't be verified, nothing trades.