There's a specific kind of frustration when the tool you're using to debug your code is itself broken.

Claude Code was crashing. Not on every session, not predictably, but enough that I couldn't trust it. It would exit mid-task with no useful error. I'd restart it — I had a cc() alias in my .zshrc that auto-restarted on failure — and that was the first problem. The auto-restart was hiding the exit event. I couldn't see whether it was crashing, being cancelled, or just stopping normally. All I knew was: it wasn't finishing tasks.

The layers

By the time I wrote the crash test matrix, the evidence pointed at more than one issue at different times:

Three distinct failure modes. Possibly active simultaneously. The naive approach would be to fix them one by one and hope for the best. The problem is that fixing one while two others are active means you can't tell if your fix did anything.

The matrix

I built a bash script — tools/claude_failure_matrix.sh — that runs four test cases with clear isolation boundaries:

full — current behaviour. Hooks enabled, MCPs enabled, normal project state. This establishes whether the failure is still reproducible.

no-hooks — same as full, but both ~/.claude/hooks.json and ./.claude/hooks.json are temporarily moved aside before the run. If full fails but no-hooks passes, hooks are the cause.

no-mcp — hooks on, but Claude runs with --strict-mcp-config --mcp-config ~/.claude/empty-mcp.json. If full fails but no-mcp passes, the MCP startup churn is causal.

minimal — hooks off, MCPs off, --setting-sources local, slash commands disabled. The smallest possible Claude surface. If this fails too, the problem is inside the core runtime or the terminal PTY path — and the clean reproduction case you'd send to Anthropic.

Each case stores the exact command used, a config snapshot before the run, full claude-diag stdout/stderr, and a summary — all under ~/.claude/diagnostics/matrix-<timestamp>/. The script restores hooks on exit via a trap so the backup suffix files don't linger.

The decision table

What made the matrix useful wasn't just running the cases — it was knowing what each outcome meant before running:

Only full fails → current config/state issue. Look at ~/.claude.json, active MCP list, active hooks.

full and no-hooks fail, no-mcp passes → MCP problem. Focus on google and playwright.

full fails, no-hooks passes → hook problem. Check convergence-check.sh and any hook returning non-JSON.

Everything fails including minimal → core runtime or PTY. You now have a clean repro for Anthropic: no hooks, no MCP, local settings only.

What I found

The convergence-check.sh hook was one of the culprits. It was injecting trading metrics into every prompt path — including sessions where I was doing something completely unrelated. When the metrics script hit an error state, it wasn't returning valid JSON. Claude's hooks contract is strict: malformed hook output aborts the session.

The MCP churn was real but secondary. The google and playwright servers were reconnecting during runs due to idle timeouts — annoying but not fatal once the hook issue was fixed.

The actual lesson

I'd been blaming Claude for being unreliable. The real problem was that I'd built a lot of automation around it — hooks, MCPs, wrapper scripts, auto-restarts — and then when something broke I had no way to tell which layer was responsible.

The matrix took about two hours to build and write properly. It saved me days of guessing. Every time the debugging target is a tool you depend on, the first investment should be isolation infrastructure, not random fixes.

The cc() auto-restart alias is gone. If Claude exits, I see it exit.