I built a 12-model AI system because one model wasn't enough

Multiagent network: Opus at the center, connected to GPT-5.x variants, Haiku subagents, and Kimi K2.5

I have twelve AI models running in parallel. I’m one person. Cost: Claude Pro, ChatGPT Plus, and maybe €5–10 extra in Claude credits when I hit the limit mid-task. That’s it. ChatGPT Pro at this level of use — models as external nodes, not a daily chatbot — is actually hard to burn through. Claude tokens go faster because it’s the brain of the operation. Setup took a weekend.

Now, why do this at all. I use AI constantly — for my own projects and for client work. And there’s one thing that rarely gets said out loud: one model means one perspective. One set of blind spots. When I rely solely on a single model’s answer, I have no way to check whether it’s a good answer or just a confident-sounding one.

Multiagent AI is exactly what it sounds like. Instead of one model, you build a system where several work together — some analyze, some handle the operational work, some vote on answers and assemble them into a final recommendation. The quality difference is real when three models independently reach the same conclusion versus one model making a best guess.

For me this matters most during research and when entering new territory. I can hand clients conclusions I can actually stand behind — not because one model said so, but because three independent sources said the same thing.

What I actually built

The brain is Claude Opus 4.6, running through Claude Code in my terminal. This is HAL, my main AI assistant. HAL handles reasoning, decisions, writing, code review, basically anything that needs real thinking. But Opus tokens aren’t cheap, so I don’t spend them on searching through files, running shell commands, or fetching web pages. That’s where subagents come in.

Built into Claude Code, I have Haiku-powered subagents. Small, fast, low-cost workers that handle specific jobs. Bash agent runs terminal commands. Explore agent searches through codebases. Plan agent does architecture work with a strict 300-token cap so it stays focused. Opus thinks, Haiku executes. My token bill dropped once I started delegating properly.

Then there’s HydraMCP. This is what makes it a real multiagent system. HydraMCP is a local MCP server that connects to external models. I have 10 GPT-5.x variants through CLIProxyAPI, plus Kimi K2.5 through Nvidia NIM. Twelve models total when you count Claude.

HydraMCP gives me four tools:

ask_model — ping a single model with a question, get a response. Good for quick sanity checks.
compare_models — send the same prompt to multiple models in parallel. See how each answered, with latency and token metrics side by side.
consensus — ask three or more models the same question and check if they agree. Strategies: majority, supermajority, or unanimous.
synthesize — take the best parts of multiple responses and combine them into one answer.

The workflow in practice: a client asks a product strategy question. I tell HAL to run consensus across three GPT-5.x variants. If they all agree, I have high confidence. If they disagree, I run compare_models to see where the split is. Then I either synthesize the best parts or use my own judgment. The AI gives me better inputs to decide with. It doesn’t decide for me.

Lately, during research, I let HAL work through a problem on its own, present its conclusions, and if I think it’s right and the sources back it up — I accept the findings and drop them straight into the project documents.

What actually works

February 7, 2026. I ran a full stress test. Not a controlled benchmark, a real session where I threw actual tasks at it and watched what happened.

Stress test results: model response times and consensus

Response times were solid, according to HAL. gpt-5.1-codex came back in 960ms. gpt-5-codex hit 700ms on code-heavy questions. Running compare_models across four GPT-5.x variants in parallel worked without a hitch. All four responses came back, formatted, side by side, ready to compare. The consensus tool hit unanimous agreement, 3 out of 3, on the test questions.

The real win was stability. The system ran for 90 minutes autonomously. No human intervention, no crashes, no weird hangs. HAL delegated to subagents, subagents did their work, HydraMCP called external models, results flowed back up. I got 90 minutes back while making dinner.

What absolutely doesn’t work

Now for the part most AI blog posts skip.

Kimi K2.5 stopped responding. Twice. I asked it to write a draft of this very blog post. First attempt: sent the prompt, waited over two minutes, got back an empty response. Literally null content. Tried again. Same result. I benched it and moved on because I had work to deliver.

Later I dug into what went wrong. Kimi K2.5 is a reasoning model. Before it writes its actual response, it does internal “thinking,” a chain-of-thought process that burns through tokens. When your max_tokens setting is too low, the model can spend everything on thinking and leave nothing for the actual output. From my end, it looked like the model waited and returned nothing. From Kimi’s end, it was doing work; it just had no budget left to produce an answer.

The fix had two parts: enforce a minimum of 512 tokens for any reasoning model call, and fall back to reading the reasoning_content field if the main content comes back empty. That way, even if the model burns all its tokens on thinking, I can still see what it was reasoning about. This matters especially when NVIDIA’s servers are under load — developers worldwide are burning through their free token quotas, and response times can stretch significantly.

The lesson: if you’re building a multiagent system, you need fallbacks. Real ones. In practice, it’s a plan B that kicks in when an agent returns empty output, doesn’t respond, or hits a rate limit. For me, that means logic that first tries reasoning_content if content comes back empty. Other times I route to a different model in the pool, one that’s been more reliable lately. Response gets cut off — I bump max_tokens and retry. Not every model behaves the way the docs suggest. Kimi K2.5 is a good model when it works. But “when it works” is doing a lot of heavy lifting in that sentence.

Update (Feb 15): After applying both fixes, Kimi K2.5 has been running stable for over a week. It’s slow — 45 seconds per response is typical for a reasoning model — but it reliably returns results now. I use it for tasks where I want a genuinely different perspective from the GPT family. The fix was worth it.

How to build your own

Start with two models, not thirteen. Get one strong reasoning model as your main brain. Claude Opus, GPT-5, whatever you prefer. Then add one cheap, fast model for routine work. Get that working reliably before you add more. The complexity of a multiagent system increases with every model. Each addition brings new failure modes, API quirks, and billing. I started with two and added more only when I had a specific reason.

And here’s what HAL has to say about running the thing day-to-day:

Health-check everything. Before you send a real task to a model, ping it with something trivial. “What’s 2+2?” If it can’t answer that in under five seconds, don’t send it your important prompt. I run health checks at the start of every session now. It takes about thirty seconds and has saved me from sending work to models that were down or rate-limited. Log every request, every response time, every failure. When something breaks at 2 AM, logs are the difference between fixing it quickly and debugging blind.

Know when to stop asking models and start asking humans. Consensus is useful when three models independently agree. But when they disagree, and you run it again, and they still disagree, that’s usually not a technical problem. It’s a hard question. Last week I ran consensus on a pricing strategy and got a 2-1 split. Ran it again, same split. So I took the two positions, summarized them in three paragraphs, and sent both to my client with my own recommendation. That was more useful than running it a third time hoping for convergence.

Keep your costs visible. I track token usage per model, per session. When Opus costs start creeping up, I look at what tasks could be delegated to Haiku. When a model consistently gives mediocre results, I bench it rather than paying for responses I won’t use. Treat your model roster like a working team. Each model needs to justify its place.

It’s plumbing, with error handling that you have to take seriously. Watching four models respond in parallel and converge on the same answer in under a second makes you forget the weekend you spent wiring it together.

Got your own AI experiments going, or still figuring out where to start? Drop me a line — happy to talk through what’s worth building.