The four bugs that nearly broke production

The bugs in a live trading system don't announce themselves. They show up as wrong numbers and silent failures at 3am.

I had four of them in one week.

Bug 1: The rate limit storm

My trading engine was polling the Hyperliquid API for order status more aggressively than I realised. The rate limiter kicked in. Instead of backing off correctly, the retry logic immediately retried, triggering more rate limit errors, each of which triggered more retries. A feedback loop that ran for 20 minutes before the process died.

The fix was exponential backoff with jitter — the standard solution that I hadn't implemented correctly. The jitter matters: without it, multiple processes retry at the same intervals and storm the endpoint together.

Bug 2: The orphaned position

When the reconciliation function ran after a restart, it repopulated the in-memory trader dictionary from the exchange's reported positions. But one trader instance had been partially initialised before the crash — it was in the exchange's data but not in my system's mapping. The position sat open with no stop-loss attached.

The fix was making reconciliation authoritative: the exchange state overrides local state, always. If a position exists on the exchange, a trader must own it. If no trader owns it, create one. Never assume local and remote are consistent.

Bug 3: The state that didn't persist

Win records were being updated in memory but not written to disk. Every restart started with a blank win history. The circuit breaker was calculating thresholds against a zeroed win rate instead of the actual historical rate. My strategy had a 60% win rate. The circuit breaker was treating it as 0% and applying a very aggressive halt threshold.

The asymmetric bug: losses were being persisted (they triggered the state write). Wins weren't. It had been running like this for two weeks.

Bug 4: The AVAX dust position

An AVAX position fell below minimum notional after a partial close. The close loop tried to close it. The exchange rejected the order (too small). The loop retried. The exchange rejected again. Infinite loop. The watcher that should have caught this was checking whether Kaleo (the orchestration service) was available before attempting the emergency close — if Kaleo was down, the watcher did nothing.

The fix: emergency close must be independent of all other services. A position that needs to close doesn't care whether the orchestration layer is healthy.

All four bugs had one thing in common: they only manifested under conditions that didn't occur in paper trading. Live money reveals a different class of problem.