1. Why Routing Matters
The LLM landscape has exploded. OpenAI alone offers GPT-4o, GPT-4o-mini, o1, o1-mini, and GPT-3.5-turbo. Add Anthropic's Claude family, Google's Gemini lineup, Mistral, Llama, and specialized models, and you're looking at 20+ viable production models.
Each model sits at a different point on the cost-quality-speed surface. GPT-4o produces excellent results but costs $5/million input tokens and takes 2-3 seconds per request. GPT-4o-mini costs $0.15/million tokens and responds in under a second โ but it struggles with complex reasoning.
The key insight: 80% of production LLM requests don't need the most powerful model. Simple classification, extraction, formatting, and summarization tasks run perfectly on smaller, cheaper models. Only the genuinely complex tasks โ multi-step reasoning, nuanced code generation, architectural decisions โ benefit from premium models.
At ChaozCode, routing reduced our monthly LLM spend from $12,400 to $3,200 โ a 74% reduction. The average quality score (measured by our evaluation pipeline) dropped by only 2.1%, from 0.91 to 0.89. That's $9,200/month saved for a barely perceptible quality difference.
Routing isn't just about cost. It's also about latency (smaller models respond faster), availability (if one provider is down, route to another), and compliance (some data must stay on specific providers for regulatory reasons).
2. The Cost-Performance Landscape
Understanding the model landscape is prerequisite to building a good router. Here's how the major models compare on our internal benchmarks:
| Model | Cost (per 1M tokens) | Avg Latency | Quality Score | Best For |
|---|---|---|---|---|
| GPT-4o | $5.00 / $15.00 | 2.1s | 0.94 | Complex reasoning, code gen |
| Claude Sonnet | $3.00 / $15.00 | 1.8s | 0.93 | Long context, analysis |
| GPT-4o-mini | $0.15 / $0.60 | 0.6s | 0.82 | Classification, extraction |
| Claude Haiku | $0.25 / $1.25 | 0.5s | 0.80 | Summarization, formatting |
| Gemini Flash | $0.075 / $0.30 | 0.4s | 0.78 | Simple tasks, high volume |
The cost difference between the cheapest and most expensive model is 66x. If you can route even half your traffic to cheaper models, the savings are enormous.
A model that scores 0.78 overall might score 0.95 on classification and 0.45 on complex reasoning. Always benchmark per task type, not per model. The router's job is to match tasks to models, not to pick one "best" model.
3. ML Router Architecture
ChaozCode's ML Router is a cascade-based routing system that classifies incoming requests and routes them to the optimal model based on task complexity, cost budget, latency requirements, and historical performance.
Three-Layer Cascade
class MLRouterCascade:
"""Three-layer cascade router for LLM requests"""
def __init__(self):
self.intent_classifier = IntentClassifier() # Layer 1: What kind of task?
self.complexity_scorer = ComplexityScorer() # Layer 2: How hard is it?
self.model_selector = ModelSelector() # Layer 3: Which model?
async def route(self, request: LLMRequest) -> RoutingDecision:
# Layer 1: Classify intent
intent = await self.intent_classifier.classify(request.prompt)
# e.g., "code_generation", "summarization", "classification", "reasoning"
# Layer 2: Score complexity
complexity = await self.complexity_scorer.score(request.prompt, intent)
# Returns 0.0 (trivial) to 1.0 (extremely complex)
# Layer 3: Select model
candidates = self.model_selector.get_candidates(
intent=intent,
complexity=complexity,
max_cost=request.cost_budget,
max_latency=request.latency_budget,
required_capabilities=request.capabilities
)
# Pick the cheapest candidate that meets quality threshold
selected = min(candidates, key=lambda m: m.cost_per_token)
return RoutingDecision(
model=selected,
intent=intent,
complexity=complexity,
estimated_cost=self.estimate_cost(request, selected),
confidence=selected.expected_quality
)
Intent Classification
The first layer classifies what the user is trying to do. This is fast (runs on a lightweight model or even a rule-based classifier) and determines which family of models to consider.
class IntentClassifier:
"""Classify request intent using lightweight model"""
INTENTS = [
"code_generation", # Writing new code
"code_review", # Reviewing/analyzing code
"summarization", # Condensing text
"classification", # Categorizing input
"extraction", # Pulling structured data from text
"reasoning", # Complex multi-step thinking
"translation", # Language or format conversion
"formatting", # Restructuring without changing meaning
"conversation", # Chat/dialogue
]
async def classify(self, prompt: str) -> str:
# Use a fine-tuned small model for fast classification
result = await self.classifier_model.predict(
prompt[:500], # Only need first 500 chars
labels=self.INTENTS
)
return result.top_label
4. Routing Algorithms
Once you know the intent and complexity, how do you pick the model? Three algorithms dominate.
Threshold-Based Routing
The simplest approach. Define complexity thresholds that map to model tiers.
def threshold_route(complexity: float, intent: str) -> str:
"""Route based on complexity thresholds"""
if intent in ("formatting", "classification", "extraction"):
# These tasks rarely need premium models
if complexity < 0.7:
return "gpt-4o-mini"
return "gpt-4o"
if intent in ("code_generation", "reasoning"):
# These tasks are quality-sensitive
if complexity < 0.3:
return "gpt-4o-mini"
elif complexity < 0.6:
return "claude-sonnet"
return "gpt-4o"
# Default tier
if complexity < 0.4:
return "gemini-flash"
elif complexity < 0.7:
return "gpt-4o-mini"
return "claude-sonnet"
Cost-Optimized Routing
Select the cheapest model whose expected quality exceeds a threshold. Uses historical performance data to predict quality for each task-model pair.
class CostOptimizedRouter:
"""Route to cheapest model that meets quality threshold"""
def __init__(self, quality_threshold: float = 0.85):
self.threshold = quality_threshold
self.performance_db = PerformanceDatabase()
async def route(self, intent: str, complexity: float) -> str:
# Get historical quality for each model on this intent
models = await self.performance_db.get_model_performance(intent)
# Filter to models that meet quality threshold for this complexity
viable = [
m for m in models
if m.predicted_quality(complexity) >= self.threshold
]
if not viable:
return self.fallback_model # Always have a fallback
# Sort by cost, pick cheapest
viable.sort(key=lambda m: m.cost_per_token)
return viable[0].model_id
Multi-Armed Bandit Routing
Treat model selection as an exploration-exploitation problem. Most of the time, route to the best-known model (exploit). Occasionally, try a different model to discover if it's improved (explore).
class BanditRouter:
"""Thompson Sampling for model selection"""
def __init__(self, models: List[str], exploration_rate: float = 0.1):
self.models = models
self.exploration_rate = exploration_rate
# Track successes and failures per model per intent
self.alpha = defaultdict(lambda: defaultdict(lambda: 1)) # successes
self.beta = defaultdict(lambda: defaultdict(lambda: 1)) # failures
def select_model(self, intent: str) -> str:
# Thompson Sampling: draw from Beta distribution for each model
samples = {}
for model in self.models:
a = self.alpha[intent][model]
b = self.beta[intent][model]
samples[model] = np.random.beta(a, b)
return max(samples, key=samples.get)
def record_outcome(self, intent: str, model: str, quality: float):
if quality >= 0.85: # Success threshold
self.alpha[intent][model] += 1
else:
self.beta[intent][model] += 1
5. Cascade Routing: Try Cheap First
Cascade routing is the most cost-effective strategy when you can tolerate slightly higher latency. The idea: try the cheapest model first. If its output doesn't meet quality standards, escalate to the next tier.
class CascadeRouter:
"""Try cheap models first, escalate if quality is insufficient"""
def __init__(self):
self.tiers = [
ModelTier("gemini-flash", cost=0.075, quality_check=self.basic_check),
ModelTier("gpt-4o-mini", cost=0.15, quality_check=self.standard_check),
ModelTier("claude-sonnet", cost=3.00, quality_check=None), # Final tier
]
async def generate(self, prompt: str) -> CascadeResult:
for tier in self.tiers:
response = await tier.model.generate(prompt)
# Last tier: accept regardless
if tier.quality_check is None:
return CascadeResult(response=response, model=tier.name, tier=tier)
# Check quality
if await tier.quality_check(prompt, response):
return CascadeResult(response=response, model=tier.name, tier=tier)
# Quality insufficient, escalate to next tier
raise RouterExhaustedError("All tiers failed quality checks")
async def basic_check(self, prompt: str, response: str) -> bool:
"""Fast heuristic quality check"""
if len(response.strip()) < 20:
return False
if "I don't know" in response or "I cannot" in response:
return False
return True
async def standard_check(self, prompt: str, response: str) -> bool:
"""LLM-based quality verification (uses cheap model)"""
verdict = await self.verifier.check(prompt, response)
return verdict.score >= 0.8
In production, our cascade router resolves 67% of requests at Tier 1 (cheapest), 24% at Tier 2, and only 9% require Tier 3 (premium). The average cost per request is $0.0004 compared to $0.006 if we sent everything to the premium model โ a 15x reduction.
6. A/B Testing Models in Production
New models launch constantly. How do you evaluate whether to adopt a new model without degrading production quality?
Shadow Testing
Route production traffic to the current model as normal. Simultaneously send a copy of each request to the candidate model. Compare results offline without any user impact.
class ShadowTester:
"""Test new models against production traffic"""
async def route_with_shadow(self, request: LLMRequest) -> LLMResponse:
# Primary: route normally
primary_response = await self.primary_model.generate(request)
# Shadow: test candidate (fire and forget, don't block)
asyncio.create_task(self.shadow_test(request, primary_response))
return primary_response
async def shadow_test(self, request: LLMRequest, baseline: LLMResponse):
candidate_response = await self.candidate_model.generate(request)
# Compare quality
evaluation = await self.evaluator.compare(
request=request,
baseline=baseline,
candidate=candidate_response
)
# Log results for analysis
await self.metrics.record_shadow_test(
request_id=request.id,
baseline_model=self.primary_model.name,
candidate_model=self.candidate_model.name,
quality_delta=evaluation.quality_delta,
cost_delta=evaluation.cost_delta,
latency_delta=evaluation.latency_delta
)
Gradual Rollout
After shadow testing looks promising, gradually increase the candidate's traffic share: 1% > 5% > 25% > 50% > 100%. Monitor quality metrics at each stage and roll back automatically if quality drops below threshold.
7. Building Your Own Router
Here's a practical checklist for building an LLM router:
- Start with threshold routing. Define 3 tiers (cheap, standard, premium) with simple complexity thresholds. Ship it. You'll get 50-60% cost savings immediately.
- Add intent classification. A simple keyword-based classifier is enough to start. Upgrade to a fine-tuned model when you have enough data.
- Instrument everything. Log every routing decision, model response, latency, cost, and quality score. You can't optimize what you don't measure.
- Build a feedback loop. When users flag bad responses, record which model produced them and for what task type. Use this data to adjust routing thresholds.
- Graduate to cascade routing when you need maximum cost efficiency and can tolerate slightly higher p95 latency from multi-tier evaluation.
- Add bandit exploration when new models launch frequently and you want to automatically discover whether they're better for specific task types.
The best router is the one you actually deploy. Start simple with threshold routing, then add sophistication based on real production data. Premature optimization of routing is just as wasteful as premature optimization of code.
Get Intelligent Routing Out of the Box
ChaozCode's ML Router includes cascade routing, intent classification, cost-optimized selection, and feedback-driven learning. Route 233 agents across multiple models with zero configuration.
Start Building →