The A/B test I ran on my own bot

Theory says a reasoning layer should add value over a signal-only system. I wanted to know if it actually did.

Sniper is a pure Python strategy. Same signal filters as Archer, same risk guards, but no LLM in the loop. It takes any signal that clears the threshold. Archer takes the same signals, but Claude first decides whether to act on them.

I ran them side by side on Discord — both posting reasoning and results to different channels — with the same capital allocation split between them.

What the A/B test showed

Archer skipped significantly more signals than Sniper. Some of those skips were right — the trades Sniper took that Archer passed on often lost. Some were wrong — Archer skipped signals that turned out well, citing regime uncertainty that didn't materialise.

The win rate on trades Archer took was higher than Sniper's overall win rate. But Archer's trade frequency was lower, so total P&L was harder to compare directly over a short window.

The more interesting finding: Archer's SKIP reasoning was often more accurate than its TRADE reasoning. When it passed on a signal, it could usually articulate why in terms that made sense after the fact. When it took a trade, the reasoning was sometimes post-hoc rationalisation of a signal that just passed the threshold.

What I changed

I raised Archer's signal threshold from 0.45 to 0.55 based on this. If Claude's value is in knowing when not to trade, the pre-filter should do more of the heavy lifting so Claude is deciding on higher-quality candidates.

Sniper is still running. I use it as the baseline. Archer needs to beat it over a hundred trades to justify the latency and token cost. Still accumulating the sample.