The problem with context windows is that they have edges. Everything before the edge is gone.
My Discord bot (OpenClaw, running on Hostinger) was useful in conversation but had no continuity. Every session started fresh. It didn't know what we talked about yesterday. It didn't know the decisions I'd made last week or why I made them.
I wanted it to remember. Not everything — just what mattered.
The architecture
The RAG system runs as a FastAPI server on port 8200. It has three pieces: an indexer that processes text files and code into chunks, a ChromaDB vector store, and a /query endpoint that returns ranked results for any input.
The chunker is deterministic — same file, same chunks, every time. This matters for the upsert logic: if a file changes, the new chunks replace the old ones without creating duplicates. I spent two days on this because naive RAG implementations index everything on every run and create garbage results from outdated versions of documents sitting alongside current ones.
The embed function is injected, so I can swap models without touching the core logic. In practice it uses a local embedding model through Ollama. Fast enough, free to run.
What it changed
OpenClaw now retrieves relevant context before responding — architecture notes, decisions, code patterns, whatever got indexed. When I ask it about a system I built three weeks ago, it finds the design document and the relevant code chunks rather than confabulating from training data.
It also means I have a search system over my own codebase that understands intent, not just keywords. "How does the circuit breaker decide when to halt?" retrieves the relevant file, the tests, and the design notes — not just every file that contains the word "halt".
The whole system is about 400 lines. The hard parts were the chunking strategy and the upsert logic. Everything else was plumbing.