LLM Routing and Model Selection

1. Why Routing Matters

The LLM landscape has exploded. OpenAI alone offers GPT-4o, GPT-4o-mini, o1, o1-mini, and GPT-3.5-turbo. Add Anthropic's Claude family, Google's Gemini lineup, Mistral, Llama, and specialized models, and you're looking at 20+ viable production models.

Each model sits at a different point on the cost-quality-speed surface. GPT-4o produces excellent results but costs $5/million input tokens and takes 2-3 seconds per request. GPT-4o-mini costs $0.15/million tokens and responds in under a second — but it struggles with complex reasoning.

The key insight: 80% of production LLM requests don't need the most powerful model. Simple classification, extraction, formatting, and summarization tasks run perfectly on smaller, cheaper models. Only the genuinely complex tasks — multi-step reasoning, nuanced code generation, architectural decisions — benefit from premium models.

The Cost Multiplier

At ChaozCode, routing reduced our monthly LLM spend from $12,400 to $3,200 — a 74% reduction. The average quality score (measured by our evaluation pipeline) dropped by only 2.1%, from 0.91 to 0.89. That's $9,200/month saved for a barely perceptible quality difference.

Routing isn't just about cost. It's also about latency (smaller models respond faster), availability (if one provider is down, route to another), and compliance (some data must stay on specific providers for regulatory reasons).

2. The Cost-Performance Landscape

Understanding the model landscape is prerequisite to building a good router. Here's how the major models compare on our internal benchmarks:

Model	Cost (per 1M tokens)	Avg Latency	Quality Score	Best For
GPT-4o	$5.00 / $15.00	2.1s	0.94	Complex reasoning, code gen
Claude Sonnet	$3.00 / $15.00	1.8s	0.93	Long context, analysis
GPT-4o-mini	$0.15 / $0.60	0.6s	0.82	Classification, extraction
Claude Haiku	$0.25 / $1.25	0.5s	0.80	Summarization, formatting
Gemini Flash	$0.075 / $0.30	0.4s	0.78	Simple tasks, high volume

The cost difference between the cheapest and most expensive model is 66x. If you can route even half your traffic to cheaper models, the savings are enormous.

Quality Scores Are Task-Dependent

A model that scores 0.78 overall might score 0.95 on classification and 0.45 on complex reasoning. Always benchmark per task type, not per model. The router's job is to match tasks to models, not to pick one "best" model.

3. ML Router Architecture

ChaozCode's ML Router is a cascade-based routing system that classifies incoming requests and routes them to the optimal model based on task complexity, cost budget, latency requirements, and historical performance.

Three-Layer Cascade

class MLRouterCascade:
    """Three-layer cascade router for LLM requests"""

    def __init__(self):
        self.intent_classifier = IntentClassifier()   # Layer 1: What kind of task?
        self.complexity_scorer = ComplexityScorer()    # Layer 2: How hard is it?
        self.model_selector = ModelSelector()          # Layer 3: Which model?

    async def route(self, request: LLMRequest) -> RoutingDecision:
        # Layer 1: Classify intent
        intent = await self.intent_classifier.classify(request.prompt)
        # e.g., "code_generation", "summarization", "classification", "reasoning"

        # Layer 2: Score complexity
        complexity = await self.complexity_scorer.score(request.prompt, intent)
        # Returns 0.0 (trivial) to 1.0 (extremely complex)

        # Layer 3: Select model
        candidates = self.model_selector.get_candidates(
            intent=intent,
            complexity=complexity,
            max_cost=request.cost_budget,
            max_latency=request.latency_budget,
            required_capabilities=request.capabilities
        )

        # Pick the cheapest candidate that meets quality threshold
        selected = min(candidates, key=lambda m: m.cost_per_token)

        return RoutingDecision(
            model=selected,
            intent=intent,
            complexity=complexity,
            estimated_cost=self.estimate_cost(request, selected),
            confidence=selected.expected_quality
        )

Intent Classification

The first layer classifies what the user is trying to do. This is fast (runs on a lightweight model or even a rule-based classifier) and determines which family of models to consider.

class IntentClassifier:
    """Classify request intent using lightweight model"""

    INTENTS = [
        "code_generation",    # Writing new code
        "code_review",        # Reviewing/analyzing code
        "summarization",      # Condensing text
        "classification",     # Categorizing input
        "extraction",         # Pulling structured data from text
        "reasoning",          # Complex multi-step thinking
        "translation",        # Language or format conversion
        "formatting",         # Restructuring without changing meaning
        "conversation",       # Chat/dialogue
    ]

    async def classify(self, prompt: str) -> str:
        # Use a fine-tuned small model for fast classification
        result = await self.classifier_model.predict(
            prompt[:500],  # Only need first 500 chars
            labels=self.INTENTS
        )
        return result.top_label

4. Routing Algorithms

Once you know the intent and complexity, how do you pick the model? Three algorithms dominate.

Threshold-Based Routing

The simplest approach. Define complexity thresholds that map to model tiers.

def threshold_route(complexity: float, intent: str) -> str:
    """Route based on complexity thresholds"""
    if intent in ("formatting", "classification", "extraction"):
        # These tasks rarely need premium models
        if complexity < 0.7:
            return "gpt-4o-mini"
        return "gpt-4o"

    if intent in ("code_generation", "reasoning"):
        # These tasks are quality-sensitive
        if complexity < 0.3:
            return "gpt-4o-mini"
        elif complexity < 0.6:
            return "claude-sonnet"
        return "gpt-4o"

    # Default tier
    if complexity < 0.4:
        return "gemini-flash"
    elif complexity < 0.7:
        return "gpt-4o-mini"
    return "claude-sonnet"

Cost-Optimized Routing

Select the cheapest model whose expected quality exceeds a threshold. Uses historical performance data to predict quality for each task-model pair.

class CostOptimizedRouter:
    """Route to cheapest model that meets quality threshold"""

    def __init__(self, quality_threshold: float = 0.85):
        self.threshold = quality_threshold
        self.performance_db = PerformanceDatabase()

    async def route(self, intent: str, complexity: float) -> str:
        # Get historical quality for each model on this intent
        models = await self.performance_db.get_model_performance(intent)

        # Filter to models that meet quality threshold for this complexity
        viable = [
            m for m in models
            if m.predicted_quality(complexity) >= self.threshold
        ]

        if not viable:
            return self.fallback_model  # Always have a fallback

        # Sort by cost, pick cheapest
        viable.sort(key=lambda m: m.cost_per_token)
        return viable[0].model_id

Multi-Armed Bandit Routing

Treat model selection as an exploration-exploitation problem. Most of the time, route to the best-known model (exploit). Occasionally, try a different model to discover if it's improved (explore).

class BanditRouter:
    """Thompson Sampling for model selection"""

    def __init__(self, models: List[str], exploration_rate: float = 0.1):
        self.models = models
        self.exploration_rate = exploration_rate
        # Track successes and failures per model per intent
        self.alpha = defaultdict(lambda: defaultdict(lambda: 1))  # successes
        self.beta = defaultdict(lambda: defaultdict(lambda: 1))   # failures

    def select_model(self, intent: str) -> str:
        # Thompson Sampling: draw from Beta distribution for each model
        samples = {}
        for model in self.models:
            a = self.alpha[intent][model]
            b = self.beta[intent][model]
            samples[model] = np.random.beta(a, b)

        return max(samples, key=samples.get)

    def record_outcome(self, intent: str, model: str, quality: float):
        if quality >= 0.85:  # Success threshold
            self.alpha[intent][model] += 1
        else:
            self.beta[intent][model] += 1

5. Cascade Routing: Try Cheap First

Cascade routing is the most cost-effective strategy when you can tolerate slightly higher latency. The idea: try the cheapest model first. If its output doesn't meet quality standards, escalate to the next tier.

class CascadeRouter:
    """Try cheap models first, escalate if quality is insufficient"""

    def __init__(self):
        self.tiers = [
            ModelTier("gemini-flash", cost=0.075, quality_check=self.basic_check),
            ModelTier("gpt-4o-mini", cost=0.15, quality_check=self.standard_check),
            ModelTier("claude-sonnet", cost=3.00, quality_check=None),  # Final tier
        ]

    async def generate(self, prompt: str) -> CascadeResult:
        for tier in self.tiers:
            response = await tier.model.generate(prompt)

            # Last tier: accept regardless
            if tier.quality_check is None:
                return CascadeResult(response=response, model=tier.name, tier=tier)

            # Check quality
            if await tier.quality_check(prompt, response):
                return CascadeResult(response=response, model=tier.name, tier=tier)

            # Quality insufficient, escalate to next tier

        raise RouterExhaustedError("All tiers failed quality checks")

    async def basic_check(self, prompt: str, response: str) -> bool:
        """Fast heuristic quality check"""
        if len(response.strip()) < 20:
            return False
        if "I don't know" in response or "I cannot" in response:
            return False
        return True

    async def standard_check(self, prompt: str, response: str) -> bool:
        """LLM-based quality verification (uses cheap model)"""
        verdict = await self.verifier.check(prompt, response)
        return verdict.score >= 0.8

Cascade Efficiency

In production, our cascade router resolves 67% of requests at Tier 1 (cheapest), 24% at Tier 2, and only 9% require Tier 3 (premium). The average cost per request is $0.0004 compared to $0.006 if we sent everything to the premium model — a 15x reduction.

6. A/B Testing Models in Production

New models launch constantly. How do you evaluate whether to adopt a new model without degrading production quality?

Shadow Testing

Route production traffic to the current model as normal. Simultaneously send a copy of each request to the candidate model. Compare results offline without any user impact.

class ShadowTester:
    """Test new models against production traffic"""

    async def route_with_shadow(self, request: LLMRequest) -> LLMResponse:
        # Primary: route normally
        primary_response = await self.primary_model.generate(request)

        # Shadow: test candidate (fire and forget, don't block)
        asyncio.create_task(self.shadow_test(request, primary_response))

        return primary_response

    async def shadow_test(self, request: LLMRequest, baseline: LLMResponse):
        candidate_response = await self.candidate_model.generate(request)

        # Compare quality
        evaluation = await self.evaluator.compare(
            request=request,
            baseline=baseline,
            candidate=candidate_response
        )

        # Log results for analysis
        await self.metrics.record_shadow_test(
            request_id=request.id,
            baseline_model=self.primary_model.name,
            candidate_model=self.candidate_model.name,
            quality_delta=evaluation.quality_delta,
            cost_delta=evaluation.cost_delta,
            latency_delta=evaluation.latency_delta
        )

Gradual Rollout

After shadow testing looks promising, gradually increase the candidate's traffic share: 1% > 5% > 25% > 50% > 100%. Monitor quality metrics at each stage and roll back automatically if quality drops below threshold.

7. Building Your Own Router

Here's a practical checklist for building an LLM router:

Start with threshold routing. Define 3 tiers (cheap, standard, premium) with simple complexity thresholds. Ship it. You'll get 50-60% cost savings immediately.
Add intent classification. A simple keyword-based classifier is enough to start. Upgrade to a fine-tuned model when you have enough data.
Instrument everything. Log every routing decision, model response, latency, cost, and quality score. You can't optimize what you don't measure.
Build a feedback loop. When users flag bad responses, record which model produced them and for what task type. Use this data to adjust routing thresholds.
Graduate to cascade routing when you need maximum cost efficiency and can tolerate slightly higher p95 latency from multi-tier evaluation.
Add bandit exploration when new models launch frequently and you want to automatically discover whether they're better for specific task types.

The best router is the one you actually deploy. Start simple with threshold routing, then add sophistication based on real production data. Premature optimization of routing is just as wasteful as premature optimization of code.

Get Intelligent Routing Out of the Box

ChaozCode's ML Router includes cascade routing, intent classification, cost-optimized selection, and feedback-driven learning. Route 233 agents across multiple models with zero configuration.

Start Building →