How Sakana trained a 7B model to orchestrate GPT, Claude and Gemini LLMs
Sakana AI released a compact 7 billion parameter model that acts as a router, deciding which of the world's largest language models. GPT-4, Claude, Gemini. should handle any given prompt. The model, called Mixture of Agents (MoA), doesn't try to compete with these giants. Instead, it coordinates them, extracting value from their collective strengths while cutting inference cost and latency for end users. This is not a new architecture in the abstract sense. It is a direct answer to a practical problem: how do you get enterprise-grade performance without paying for enterprise-grade inference on every Bullish Shares Dip on Earnings Miss, $605 Million Loss as Value of Crypto Holdings Fell
The core insight is simple but unintuitive. A smaller model can learn the decision rules for when GPT-4 excels at reasoning, when Claude handles nuance better, when Gemini's training data wins. Sakana's researchers trained the 7B model on logs of how these three models performed on thousands of tasks. The model learned patterns: if the prompt contains mathematical derivation, route to GPT-4. If it requires ethical judgment and tone-matching, route to Claude. If it's a knowledge-retrieval task with a specific date cutoff in the query, route to Gemini. The 7B model becomes a learned load balancer. It predicts which expert will perform best before making any API calls.
This changes the economics of inference. Running a 7B model locally or on-premise costs roughly 50-100x less than querying GPT-4 multiple times. Sakana's router adds one forward pass before any external call. If you're protecting against unnecessary API costs in a high-volume system, the savings compound. A customer service platform handling 10,000 queries per day no longer needs to route everything to GPT-4 just to guarantee quality. The router catches 60-70% of queries that Claude can handle adequately, saving both money and latency. The remaining 30-40% that genuinely need GPT-4's reasoning capacity still get it, but with purpose.
The training methodology matters here because it reveals the limits of what even a small model can learn. Sakana didn't train on task descriptions alone. They trained on the actual outputs of each model, measuring performance using domain-specific metrics. A math problem gets evaluated on correctness. A creative writing prompt gets human preference ratings. A coding task compiles and runs or it doesn't. The router learned not just to classify tasks abstractly but to predict which model's approach. its tendency toward brevity, its hallucination profile, its reasoning depth. matched the performance signature of each task type. This is transfer learning applied to model selection. The 7B model becomes a proxy for human judgment about which expert to consult.
What Sakana avoided is equally telling. They didn't try to improve GPT-4, Claude, or Gemini. They didn't build synthetic data. They didn't claim the router would replace these models or create some new emergent capability. They built a thinner layer of automation on top of existing infrastructure. This matters because it's defensible and deployable today. A company with access to multiple API keys and a cost constraint can integrate this immediately. There's no waiting for new model releases. There's no bet on speculative architecture. The router works because the underlying models already work.
The liability is that performance depends entirely on the quality and diversity of the training data. If the logs used to train the router came from a specific domain. say, customer support tickets. the router might route poorly on a completely different task type like scientific paper summarization. The router could overfit to statistical patterns that don't generalize. A model trained to prefer GPT-4 on reasoning tasks might route ambiguous prompts to GPT-4 out of overcaution, defeating cost optimization. Sakana would need to retrain or fine-tune the router for different domains, which is a maintenance burden that APIs handle automatically.
This work also exposes a fragile moment in AI infrastructure. As long as GPT-4, Claude, and Gemini remain closed-source and require API access, models like Sakana's router add real value. But if any one of these becomes cheaply available as a weight download. if OpenAI releases a locally-runnable version of GPT-4 or Anthropic does the same with Claude. the entire premise collapses. The router's advantage evaporates when you can run your preferred model on commodity hardware. Sakana is essentially arbitraging the gap between API costs and the actual computational footprint of inference. That gap will shrink.
For now, the work represents a clean statement about the state of LLM deployment: the frontier models are good, but they are expensive and slow. A 7 billion parameter model is good enough to know when you need the frontier, and that knowledge, applied consistently at scale, is worth a lot. Sakana has built a tax on the use of these APIs, but a tax that saves more than it costs. That's the rare kind of infrastructure that will see adoption.
ViewDAO