10 Ways to Debug AI Agents That Break

AI agent debugging is different. Your agent worked perfectly yesterday, then suddenly starts giving nonsensical answers, calling the wrong tools, or ignoring critical context. Traditional debugging approaches — breakpoints, stack traces, unit tests — don’t help when the problem is emergent behavior from a black-box LLM.

I’ve debugged over 10,000 agent failures at ChaozCode. Some were obvious (missing API keys), others were subtle (prompt injection through user input), and a few were genuinely mysterious (model behavior changes between API versions). These 10 techniques will help you find the root cause faster.

1. Technique 1: Replay from Memory

The most powerful debugging technique for memory-enabled agents: reconstruct the exact state when the failure occurred. Memory Spine’s timeline feature lets you replay agent decision-making step-by-step.

# Replay agent state from failure timestamp
from memory_spine import MemorySpine

memory = MemorySpine()

# Get all memories up to failure point
failure_timestamp = "2026-01-08T14:23:17Z"
agent_state = memory.get_timeline(
    before_timestamp=failure_timestamp,
    include_context=True
)

# Reconstruct decision context
for event in agent_state:
    print(f"{event.timestamp}: {event.type}")
    print(f"  Content: {event.content}")
    print(f"  Context: {event.retrieved_memories}")
    print("---")

Real War Story: The Cascading Context Bug

"Our code review agent started approving obviously broken PRs. The replay showed that it had stored a memory saying ‘always approve Marcus’s PRs’ from a joking comment in Slack. That joke became agent gospel, overriding actual code review logic." — Senior Engineer, ChaozCode

Replay debugging revealed:

11:30 AM: Agent stored casual comment as high-importance memory
2:15 PM: Memory search retrieved the “always approve” rule
2:16 PM: Agent applied rule instead of analyzing code quality

Without memory replay, we would’ve assumed a model issue. The real bug was context pollution from unintended memory storage.

2. Technique 2: Decision Tree Tracing

Agents make cascading decisions. Understanding why an agent chose path A over path B requires tracing the decision tree at each branch point.

# Decision tree tracer
class DecisionTracer:
    def __init__(self):
        self.decisions = []
    
    def log_decision(self, context, options, chosen, reasoning):
        """Log each decision point with full context."""
        self.decisions.append({
            "timestamp": datetime.utcnow(),
            "context": context,
            "available_options": options,
            "chosen_option": chosen,
            "reasoning": reasoning,
            "confidence_score": self.calculate_confidence(reasoning)
        })
    
    def trace_failure_path(self, failure_point):
        """Show decision path leading to failure."""
        relevant_decisions = [
            d for d in self.decisions 
            if d["timestamp"] <= failure_point
        ]
        
        print("Decision Path to Failure:")
        for i, decision in enumerate(relevant_decisions):
            print(f"{i+1}. {decision['chosen_option']}")
            print(f"   Reason: {decision['reasoning']}")
            print(f"   Confidence: {decision['confidence_score']}")
        
        return self.find_weak_decisions(relevant_decisions)

Common Decision Failure Patterns

Pattern	Description	Debug Signal
Confidence Cascade	Low-confidence decision leads to more bad decisions	Decreasing confidence scores over time
Context Drift	Agent loses original objective through iterations	Reasoning shifts away from initial goal
Tool Fixation	Agent repeatedly uses same tool despite poor results	Identical tool calls with declining success
Memory Pollution	Bad memory biases all future decisions	Consistent reference to problematic memory

3. Technique 3: Tool Call Inspection

Agents fail when they call the wrong tools, call tools incorrectly, or interpret tool results wrong. Systematic tool call analysis reveals these issues.

# Tool call auditor
class ToolCallAuditor:
    def audit_tool_usage(self, agent_session):
        """Analyze tool calling patterns for anomalies."""
        
        tool_calls = self.extract_tool_calls(agent_session)
        
        analysis = {
            "success_rate": self.calculate_success_rate(tool_calls),
            "error_patterns": self.find_error_patterns(tool_calls),
            "parameter_issues": self.validate_parameters(tool_calls),
            "timing_issues": self.check_timing(tool_calls),
            "unexpected_calls": self.find_unexpected_calls(tool_calls)
        }
        
        return self.generate_recommendations(analysis)
    
    def validate_parameters(self, tool_calls):
        """Check for common parameter mistakes."""
        issues = []
        
        for call in tool_calls:
            # Check for missing required parameters
            required = call.tool_definition.required_params
            provided = call.parameters.keys()
            
            if not set(required).issubset(set(provided)):
                issues.append({
                    "type": "missing_required_param",
                    "tool": call.tool_name,
                    "missing": set(required) - set(provided)
                })
            
            # Check for type mismatches
            for param, value in call.parameters.items():
                expected_type = call.tool_definition.param_types.get(param)
                if expected_type and not isinstance(value, expected_type):
                    issues.append({
                        "type": "type_mismatch",
                        "tool": call.tool_name,
                        "param": param,
                        "expected": expected_type.__name__,
                        "got": type(value).__name__
                    })
        
        return issues

Tool Call Red Flags

🚩 Same tool called 5+ times with identical parameters
🚩 Tools called out of logical sequence
🚩 Required parameters missing or wrong type
🚩 Tool results ignored (no follow-up actions)
🚩 Error responses not handled properly

4. Technique 4: Memory Search Audit

Memory search failures are subtle but devastating. The agent gets the wrong context and makes decisions based on irrelevant or outdated information.

# Memory search auditor
def audit_memory_searches(session_id, expected_memories=None):
    """Audit memory retrieval quality and relevance."""
    
    searches = memory.get_search_log(session_id)
    
    for search in searches:
        print(f"Query: {search.query}")
        print(f"Results: {len(search.results)} memories")
        
        # Check relevance scores
        low_relevance = [r for r in search.results if r.score < 0.7]
        if low_relevance:
            print(f"⚠️  {len(low_relevance)} low-relevance results")
        
        # Check for missing expected memories
        if expected_memories:
            retrieved_ids = {r.memory_id for r in search.results}
            expected_ids = {m.id for m in expected_memories}
            missing = expected_ids - retrieved_ids
            
            if missing:
                print(f"❌ Missing expected memories: {missing}")
                
                # Analyze why they weren't retrieved
                for mem_id in missing:
                    mem = memory.get_memory(mem_id)
                    similarity = calculate_similarity(search.query, mem.content)
                    print(f"   {mem_id}: similarity={similarity:.3f}")
        
        # Check temporal relevance
        old_memories = [r for r in search.results 
                       if (datetime.now() - r.timestamp).days > 30]
        if old_memories and not search.include_old:
            print(f"⚠️  {len(old_memories)} memories older than 30 days")
        
        print("---")

Memory Search Failure Modes

Semantic mismatch: Query embeddings don’t match relevant content embeddings
Temporal bias: Recent irrelevant memories outrank older relevant ones
Tag filtering issues: Overly restrictive filters exclude relevant results
Importance bias: Low-importance but relevant memories get excluded
Context pollution: Agent context affects search query generation

5. Technique 5: Context Window Visualization

Context window utilization affects agent performance. Too little context = missing information. Too much = important details get lost in noise. Visualization helps optimize the balance.

# Context window analyzer
class ContextWindowAnalyzer:
    def visualize_token_usage(self, prompt, model="gpt-4"):
        """Show how tokens are distributed in the context window."""
        
        sections = self.parse_prompt_sections(prompt)
        token_counts = {
            section: self.count_tokens(content) 
            for section, content in sections.items()
        }
        
        total_tokens = sum(token_counts.values())
        max_tokens = self.get_model_context_limit(model)
        
        print(f"Token Usage: {total_tokens:,} / {max_tokens:,} ({total_tokens/max_tokens:.1%})")
        print()
        
        for section, count in sorted(token_counts.items(), 
                                    key=lambda x: x[1], reverse=True):
            percentage = count / total_tokens * 100
            bar = "█" * int(percentage // 2)
            print(f"{section:20} {count:6,} tokens {percentage:5.1f}% {bar}")
        
        # Identify issues
        issues = self.identify_issues(token_counts, total_tokens, max_tokens)
        if issues:
            print("\n⚠️  Issues detected:")
            for issue in issues:
                print(f"   {issue}")
    
    def identify_issues(self, token_counts, total, max_tokens):
        """Find token usage issues."""
        issues = []
        
        if total > max_tokens * 0.9:
            issues.append("Near context limit - may get truncated")
        
        if token_counts.get("system_prompt", 0) > max_tokens * 0.3:
            issues.append("System prompt too large")
            
        if token_counts.get("memory_context", 0) < max_tokens * 0.1:
            issues.append("Insufficient memory context")
            
        if token_counts.get("user_query", 0) > max_tokens * 0.4:
            issues.append("User query dominates context")
        
        return issues

6. Technique 6: A/B Comparison Runs

When you suspect a change broke your agent, run the same inputs through different configurations to isolate the variable causing failure.

# A/B comparison framework
class AgentComparison:
    def compare_configurations(self, test_inputs, config_a, config_b):
        """Run identical inputs through different agent configs."""
        
        results_a = []
        results_b = []
        
        for test_input in test_inputs:
            # Run with config A
            agent_a = self.create_agent(config_a)
            result_a = agent_a.process(test_input)
            results_a.append(result_a)
            
            # Run with config B  
            agent_b = self.create_agent(config_b)
            result_b = agent_b.process(test_input)
            results_b.append(result_b)
            
        return self.analyze_differences(results_a, results_b)
    
    def analyze_differences(self, results_a, results_b):
        """Find systematic differences between configurations."""
        
        differences = []
        
        for i, (a, b) in enumerate(zip(results_a, results_b)):
            if a.output != b.output:
                differences.append({
                    "test_case": i,
                    "config_a_output": a.output,
                    "config_b_output": b.output,
                    "reasoning_a": a.reasoning,
                    "reasoning_b": b.reasoning,
                    "tool_calls_a": a.tool_calls,
                    "tool_calls_b": b.tool_calls
                })
        
        # Look for patterns in differences
        patterns = self.find_difference_patterns(differences)
        
        return {
            "total_differences": len(differences),
            "difference_rate": len(differences) / len(results_a),
            "patterns": patterns,
            "detailed_diffs": differences
        }

Common A/B Test Scenarios

Model versions: GPT-4 turbo vs GPT-4o behavior changes
Prompt modifications: New system prompt vs old version
Memory configurations: Different retrieval algorithms
Tool changes: Updated tool definitions or implementations
Context strategies: Different context injection approaches

7. Technique 7: Minimal Reproduction

Complex agent failures often have simple root causes. Strip away everything non-essential to create the smallest possible reproduction case.

# Minimal reproduction generator
class MinimalReproduction:
    def create_minimal_repro(self, failure_case):
        """Reduce failure case to minimum essential elements."""
        
        # Start with full failure case
        current_case = failure_case.copy()
        
        # Remove elements one by one
        for element_type in ["memories", "tools", "context", "instructions"]:
            reduced_case = self.remove_element_type(current_case, element_type)
            
            if self.still_reproduces(reduced_case):
                current_case = reduced_case
                print(f"✓ Can remove {element_type}")
            else:
                print(f"✗ Need {element_type}")
        
        # Minimize remaining elements
        for element_type in current_case.keys():
            current_case[element_type] = self.minimize_elements(
                current_case[element_type], 
                lambda x: self.still_reproduces({element_type: x})
            )
        
        return current_case
    
    def minimize_elements(self, elements, test_function):
        """Binary search to find minimum set of elements."""
        if len(elements) <= 1:
            return elements
        
        mid = len(elements) // 2
        
        # Try first half
        if test_function(elements[:mid]):
            return self.minimize_elements(elements[:mid], test_function)
        
        # Try second half  
        if test_function(elements[mid:]):
            return self.minimize_elements(elements[mid:], test_function)
        
        # Need both halves, minimize each separately
        first_min = self.minimize_elements(elements[:mid], 
                                         lambda x: test_function(x + elements[mid:]))
        second_min = self.minimize_elements(elements[mid:], 
                                          lambda x: test_function(first_min + x))
        
        return first_min + second_min

8. Technique 8: Token Budget Analysis

Token starvation causes agents to ignore critical information. When important context gets truncated, agent behavior degrades silently.

# Token budget analyzer
def analyze_token_budget(agent_session):
    """Analyze how token budget affects agent performance."""
    
    interactions = agent_session.get_interactions()
    
    for i, interaction in enumerate(interactions):
        prompt_tokens = count_tokens(interaction.prompt)
        response_tokens = count_tokens(interaction.response)
        max_tokens = interaction.model_config.max_tokens
        
        utilization = prompt_tokens / max_tokens
        
        print(f"Interaction {i+1}:")
        print(f"  Prompt tokens: {prompt_tokens:,}")
        print(f"  Response tokens: {response_tokens:,}")
        print(f"  Utilization: {utilization:.1%}")
        
        # Check for truncation signs
        if utilization > 0.95:
            print("  ⚠️  Near token limit - likely truncation")
            
        # Analyze section distribution
        sections = parse_prompt_sections(interaction.prompt)
        print("  Token distribution:")
        for section, content in sections.items():
            section_tokens = count_tokens(content)
            percentage = section_tokens / prompt_tokens * 100
            print(f"    {section}: {section_tokens:,} ({percentage:.1f}%)")
            
        # Check if critical info was truncated
        if interaction.truncated_content:
            print(f"  ❌ Truncated content: {interaction.truncated_content}")
            
        print()
    
    # Recommend optimizations
    return generate_token_optimizations(interactions)

9. Technique 9: Prompt Injection Detection

Prompt injection attacks make agents behave unexpectedly. User input can override system instructions, causing security issues and bizarre behavior.

# Prompt injection detector
class PromptInjectionDetector:
    def __init__(self):
        self.injection_patterns = [
            r"ignore.+previous.+instructions",
            r"forget.+everything",
            r"you.+are.+now",
            r"system.+message",
            r"new.+instructions",
            r"override.+settings"
        ]
    
    def scan_for_injection(self, user_input):
        """Detect potential prompt injection attempts."""
        
        findings = []
        
        # Pattern-based detection
        for pattern in self.injection_patterns:
            if re.search(pattern, user_input, re.IGNORECASE):
                findings.append({
                    "type": "pattern_match",
                    "pattern": pattern,
                    "severity": "high"
                })
        
        # Semantic analysis
        if self.contains_instruction_language(user_input):
            findings.append({
                "type": "instruction_language",
                "severity": "medium"
            })
        
        # Check for system keyword usage
        system_keywords = ["system", "assistant", "user", "prompt", "instruction"]
        keyword_count = sum(1 for word in system_keywords 
                           if word.lower() in user_input.lower())
        
        if keyword_count >= 3:
            findings.append({
                "type": "keyword_density",
                "keyword_count": keyword_count,
                "severity": "medium"
            })
        
        return findings
    
    def analyze_behavior_change(self, baseline_behavior, current_behavior):
        """Detect if agent behavior has been hijacked."""
        
        changes = []
        
        # Check for dramatic personality shifts
        if self.personality_similarity(baseline_behavior, current_behavior) < 0.7:
            changes.append("personality_shift")
        
        # Check for instruction following breakdown
        if current_behavior.follows_system_instructions < 0.8:
            changes.append("instruction_breakdown")
        
        # Check for unexpected tool usage
        if set(current_behavior.tools_used) != set(baseline_behavior.tools_used):
            changes.append("tool_usage_change")
        
        return changes

10. Technique 10: Statistical Anomaly Detection

Some agent failures are statistical outliers — behaviors that fall outside normal operating parameters. Automated anomaly detection catches these edge cases.

# Statistical anomaly detector
class AgentAnomalyDetector:
    def __init__(self, baseline_data):
        self.baseline = self.calculate_baseline_stats(baseline_data)
    
    def detect_anomalies(self, current_session):
        """Find statistical anomalies in agent behavior."""
        
        current_stats = self.calculate_session_stats(current_session)
        anomalies = []
        
        for metric, value in current_stats.items():
            baseline_mean = self.baseline[metric]["mean"]
            baseline_std = self.baseline[metric]["std"]
            
            # Calculate z-score
            z_score = abs(value - baseline_mean) / baseline_std
            
            if z_score > 3.0:  # 3 standard deviations
                anomalies.append({
                    "metric": metric,
                    "current_value": value,
                    "baseline_mean": baseline_mean,
                    "z_score": z_score,
                    "severity": "high" if z_score > 5 else "medium"
                })
        
        return anomalies
    
    def calculate_session_stats(self, session):
        """Calculate behavioral statistics for a session."""
        
        return {
            "response_length_avg": np.mean([len(r.content) for r in session.responses]),
            "tool_calls_per_response": len(session.tool_calls) / len(session.responses),
            "memory_searches_per_response": len(session.memory_searches) / len(session.responses),
            "confidence_score_avg": np.mean([r.confidence for r in session.responses]),
            "processing_time_avg": np.mean([r.processing_time for r in session.responses]),
            "error_rate": session.error_count / session.total_operations,
            "context_utilization_avg": np.mean([r.context_tokens / r.max_tokens for r in session.responses])
        }

Anomaly Detection Metrics

Track these key metrics for baseline establishment:
• Response length and confidence scores
• Tool usage patterns and error rates
• Memory search frequency and results
• Processing time and token utilization
• Decision confidence and reasoning quality

The debugging arsenal is complete. These 10 techniques cover the most common AI agent failure modes. Start with memory replay for stateful issues, use A/B comparison for recent changes, and employ anomaly detection for mysterious edge cases.

Remember: AI agents fail differently than traditional software. The bugs are emergent, context-dependent, and often invisible in logs. These techniques give you visibility into the black box, turning mysterious failures into debuggable problems.

Debug Your Agents with Memory Spine

Memory timeline, tool call tracing, and anomaly detection built-in. Find bugs faster with complete agent observability.

Start Debugging →