AI agent debugging is different. Your agent worked perfectly yesterday, then suddenly starts giving nonsensical answers, calling the wrong tools, or ignoring critical context. Traditional debugging approaches — breakpoints, stack traces, unit tests — don’t help when the problem is emergent behavior from a black-box LLM.
I’ve debugged over 10,000 agent failures at ChaozCode. Some were obvious (missing API keys), others were subtle (prompt injection through user input), and a few were genuinely mysterious (model behavior changes between API versions). These 10 techniques will help you find the root cause faster.
1. Technique 1: Replay from Memory
The most powerful debugging technique for memory-enabled agents: reconstruct the exact state when the failure occurred. Memory Spine’s timeline feature lets you replay agent decision-making step-by-step.
# Replay agent state from failure timestamp
from memory_spine import MemorySpine
memory = MemorySpine()
# Get all memories up to failure point
failure_timestamp = "2026-01-08T14:23:17Z"
agent_state = memory.get_timeline(
before_timestamp=failure_timestamp,
include_context=True
)
# Reconstruct decision context
for event in agent_state:
print(f"{event.timestamp}: {event.type}")
print(f" Content: {event.content}")
print(f" Context: {event.retrieved_memories}")
print("---")
Real War Story: The Cascading Context Bug
"Our code review agent started approving obviously broken PRs. The replay showed that it had stored a memory saying ‘always approve Marcus’s PRs’ from a joking comment in Slack. That joke became agent gospel, overriding actual code review logic." — Senior Engineer, ChaozCode
Replay debugging revealed:
- 11:30 AM: Agent stored casual comment as high-importance memory
- 2:15 PM: Memory search retrieved the “always approve” rule
- 2:16 PM: Agent applied rule instead of analyzing code quality
Without memory replay, we would’ve assumed a model issue. The real bug was context pollution from unintended memory storage.
2. Technique 2: Decision Tree Tracing
Agents make cascading decisions. Understanding why an agent chose path A over path B requires tracing the decision tree at each branch point.
# Decision tree tracer
class DecisionTracer:
def __init__(self):
self.decisions = []
def log_decision(self, context, options, chosen, reasoning):
"""Log each decision point with full context."""
self.decisions.append({
"timestamp": datetime.utcnow(),
"context": context,
"available_options": options,
"chosen_option": chosen,
"reasoning": reasoning,
"confidence_score": self.calculate_confidence(reasoning)
})
def trace_failure_path(self, failure_point):
"""Show decision path leading to failure."""
relevant_decisions = [
d for d in self.decisions
if d["timestamp"] <= failure_point
]
print("Decision Path to Failure:")
for i, decision in enumerate(relevant_decisions):
print(f"{i+1}. {decision['chosen_option']}")
print(f" Reason: {decision['reasoning']}")
print(f" Confidence: {decision['confidence_score']}")
return self.find_weak_decisions(relevant_decisions)
Common Decision Failure Patterns
| Pattern | Description | Debug Signal |
|---|---|---|
| Confidence Cascade | Low-confidence decision leads to more bad decisions | Decreasing confidence scores over time |
| Context Drift | Agent loses original objective through iterations | Reasoning shifts away from initial goal |
| Tool Fixation | Agent repeatedly uses same tool despite poor results | Identical tool calls with declining success |
| Memory Pollution | Bad memory biases all future decisions | Consistent reference to problematic memory |
3. Technique 3: Tool Call Inspection
Agents fail when they call the wrong tools, call tools incorrectly, or interpret tool results wrong. Systematic tool call analysis reveals these issues.
# Tool call auditor
class ToolCallAuditor:
def audit_tool_usage(self, agent_session):
"""Analyze tool calling patterns for anomalies."""
tool_calls = self.extract_tool_calls(agent_session)
analysis = {
"success_rate": self.calculate_success_rate(tool_calls),
"error_patterns": self.find_error_patterns(tool_calls),
"parameter_issues": self.validate_parameters(tool_calls),
"timing_issues": self.check_timing(tool_calls),
"unexpected_calls": self.find_unexpected_calls(tool_calls)
}
return self.generate_recommendations(analysis)
def validate_parameters(self, tool_calls):
"""Check for common parameter mistakes."""
issues = []
for call in tool_calls:
# Check for missing required parameters
required = call.tool_definition.required_params
provided = call.parameters.keys()
if not set(required).issubset(set(provided)):
issues.append({
"type": "missing_required_param",
"tool": call.tool_name,
"missing": set(required) - set(provided)
})
# Check for type mismatches
for param, value in call.parameters.items():
expected_type = call.tool_definition.param_types.get(param)
if expected_type and not isinstance(value, expected_type):
issues.append({
"type": "type_mismatch",
"tool": call.tool_name,
"param": param,
"expected": expected_type.__name__,
"got": type(value).__name__
})
return issues
🚩 Same tool called 5+ times with identical parameters
🚩 Tools called out of logical sequence
🚩 Required parameters missing or wrong type
🚩 Tool results ignored (no follow-up actions)
🚩 Error responses not handled properly
4. Technique 4: Memory Search Audit
Memory search failures are subtle but devastating. The agent gets the wrong context and makes decisions based on irrelevant or outdated information.
# Memory search auditor
def audit_memory_searches(session_id, expected_memories=None):
"""Audit memory retrieval quality and relevance."""
searches = memory.get_search_log(session_id)
for search in searches:
print(f"Query: {search.query}")
print(f"Results: {len(search.results)} memories")
# Check relevance scores
low_relevance = [r for r in search.results if r.score < 0.7]
if low_relevance:
print(f"⚠️ {len(low_relevance)} low-relevance results")
# Check for missing expected memories
if expected_memories:
retrieved_ids = {r.memory_id for r in search.results}
expected_ids = {m.id for m in expected_memories}
missing = expected_ids - retrieved_ids
if missing:
print(f"❌ Missing expected memories: {missing}")
# Analyze why they weren't retrieved
for mem_id in missing:
mem = memory.get_memory(mem_id)
similarity = calculate_similarity(search.query, mem.content)
print(f" {mem_id}: similarity={similarity:.3f}")
# Check temporal relevance
old_memories = [r for r in search.results
if (datetime.now() - r.timestamp).days > 30]
if old_memories and not search.include_old:
print(f"⚠️ {len(old_memories)} memories older than 30 days")
print("---")
Memory Search Failure Modes
- Semantic mismatch: Query embeddings don’t match relevant content embeddings
- Temporal bias: Recent irrelevant memories outrank older relevant ones
- Tag filtering issues: Overly restrictive filters exclude relevant results
- Importance bias: Low-importance but relevant memories get excluded
- Context pollution: Agent context affects search query generation
5. Technique 5: Context Window Visualization
Context window utilization affects agent performance. Too little context = missing information. Too much = important details get lost in noise. Visualization helps optimize the balance.
# Context window analyzer
class ContextWindowAnalyzer:
def visualize_token_usage(self, prompt, model="gpt-4"):
"""Show how tokens are distributed in the context window."""
sections = self.parse_prompt_sections(prompt)
token_counts = {
section: self.count_tokens(content)
for section, content in sections.items()
}
total_tokens = sum(token_counts.values())
max_tokens = self.get_model_context_limit(model)
print(f"Token Usage: {total_tokens:,} / {max_tokens:,} ({total_tokens/max_tokens:.1%})")
print()
for section, count in sorted(token_counts.items(),
key=lambda x: x[1], reverse=True):
percentage = count / total_tokens * 100
bar = "█" * int(percentage // 2)
print(f"{section:20} {count:6,} tokens {percentage:5.1f}% {bar}")
# Identify issues
issues = self.identify_issues(token_counts, total_tokens, max_tokens)
if issues:
print("\n⚠️ Issues detected:")
for issue in issues:
print(f" {issue}")
def identify_issues(self, token_counts, total, max_tokens):
"""Find token usage issues."""
issues = []
if total > max_tokens * 0.9:
issues.append("Near context limit - may get truncated")
if token_counts.get("system_prompt", 0) > max_tokens * 0.3:
issues.append("System prompt too large")
if token_counts.get("memory_context", 0) < max_tokens * 0.1:
issues.append("Insufficient memory context")
if token_counts.get("user_query", 0) > max_tokens * 0.4:
issues.append("User query dominates context")
return issues
6. Technique 6: A/B Comparison Runs
When you suspect a change broke your agent, run the same inputs through different configurations to isolate the variable causing failure.
# A/B comparison framework
class AgentComparison:
def compare_configurations(self, test_inputs, config_a, config_b):
"""Run identical inputs through different agent configs."""
results_a = []
results_b = []
for test_input in test_inputs:
# Run with config A
agent_a = self.create_agent(config_a)
result_a = agent_a.process(test_input)
results_a.append(result_a)
# Run with config B
agent_b = self.create_agent(config_b)
result_b = agent_b.process(test_input)
results_b.append(result_b)
return self.analyze_differences(results_a, results_b)
def analyze_differences(self, results_a, results_b):
"""Find systematic differences between configurations."""
differences = []
for i, (a, b) in enumerate(zip(results_a, results_b)):
if a.output != b.output:
differences.append({
"test_case": i,
"config_a_output": a.output,
"config_b_output": b.output,
"reasoning_a": a.reasoning,
"reasoning_b": b.reasoning,
"tool_calls_a": a.tool_calls,
"tool_calls_b": b.tool_calls
})
# Look for patterns in differences
patterns = self.find_difference_patterns(differences)
return {
"total_differences": len(differences),
"difference_rate": len(differences) / len(results_a),
"patterns": patterns,
"detailed_diffs": differences
}
Common A/B Test Scenarios
- Model versions: GPT-4 turbo vs GPT-4o behavior changes
- Prompt modifications: New system prompt vs old version
- Memory configurations: Different retrieval algorithms
- Tool changes: Updated tool definitions or implementations
- Context strategies: Different context injection approaches
7. Technique 7: Minimal Reproduction
Complex agent failures often have simple root causes. Strip away everything non-essential to create the smallest possible reproduction case.
# Minimal reproduction generator
class MinimalReproduction:
def create_minimal_repro(self, failure_case):
"""Reduce failure case to minimum essential elements."""
# Start with full failure case
current_case = failure_case.copy()
# Remove elements one by one
for element_type in ["memories", "tools", "context", "instructions"]:
reduced_case = self.remove_element_type(current_case, element_type)
if self.still_reproduces(reduced_case):
current_case = reduced_case
print(f"✓ Can remove {element_type}")
else:
print(f"✗ Need {element_type}")
# Minimize remaining elements
for element_type in current_case.keys():
current_case[element_type] = self.minimize_elements(
current_case[element_type],
lambda x: self.still_reproduces({element_type: x})
)
return current_case
def minimize_elements(self, elements, test_function):
"""Binary search to find minimum set of elements."""
if len(elements) <= 1:
return elements
mid = len(elements) // 2
# Try first half
if test_function(elements[:mid]):
return self.minimize_elements(elements[:mid], test_function)
# Try second half
if test_function(elements[mid:]):
return self.minimize_elements(elements[mid:], test_function)
# Need both halves, minimize each separately
first_min = self.minimize_elements(elements[:mid],
lambda x: test_function(x + elements[mid:]))
second_min = self.minimize_elements(elements[mid:],
lambda x: test_function(first_min + x))
return first_min + second_min
8. Technique 8: Token Budget Analysis
Token starvation causes agents to ignore critical information. When important context gets truncated, agent behavior degrades silently.
# Token budget analyzer
def analyze_token_budget(agent_session):
"""Analyze how token budget affects agent performance."""
interactions = agent_session.get_interactions()
for i, interaction in enumerate(interactions):
prompt_tokens = count_tokens(interaction.prompt)
response_tokens = count_tokens(interaction.response)
max_tokens = interaction.model_config.max_tokens
utilization = prompt_tokens / max_tokens
print(f"Interaction {i+1}:")
print(f" Prompt tokens: {prompt_tokens:,}")
print(f" Response tokens: {response_tokens:,}")
print(f" Utilization: {utilization:.1%}")
# Check for truncation signs
if utilization > 0.95:
print(" ⚠️ Near token limit - likely truncation")
# Analyze section distribution
sections = parse_prompt_sections(interaction.prompt)
print(" Token distribution:")
for section, content in sections.items():
section_tokens = count_tokens(content)
percentage = section_tokens / prompt_tokens * 100
print(f" {section}: {section_tokens:,} ({percentage:.1f}%)")
# Check if critical info was truncated
if interaction.truncated_content:
print(f" ❌ Truncated content: {interaction.truncated_content}")
print()
# Recommend optimizations
return generate_token_optimizations(interactions)
9. Technique 9: Prompt Injection Detection
Prompt injection attacks make agents behave unexpectedly. User input can override system instructions, causing security issues and bizarre behavior.
# Prompt injection detector
class PromptInjectionDetector:
def __init__(self):
self.injection_patterns = [
r"ignore.+previous.+instructions",
r"forget.+everything",
r"you.+are.+now",
r"system.+message",
r"new.+instructions",
r"override.+settings"
]
def scan_for_injection(self, user_input):
"""Detect potential prompt injection attempts."""
findings = []
# Pattern-based detection
for pattern in self.injection_patterns:
if re.search(pattern, user_input, re.IGNORECASE):
findings.append({
"type": "pattern_match",
"pattern": pattern,
"severity": "high"
})
# Semantic analysis
if self.contains_instruction_language(user_input):
findings.append({
"type": "instruction_language",
"severity": "medium"
})
# Check for system keyword usage
system_keywords = ["system", "assistant", "user", "prompt", "instruction"]
keyword_count = sum(1 for word in system_keywords
if word.lower() in user_input.lower())
if keyword_count >= 3:
findings.append({
"type": "keyword_density",
"keyword_count": keyword_count,
"severity": "medium"
})
return findings
def analyze_behavior_change(self, baseline_behavior, current_behavior):
"""Detect if agent behavior has been hijacked."""
changes = []
# Check for dramatic personality shifts
if self.personality_similarity(baseline_behavior, current_behavior) < 0.7:
changes.append("personality_shift")
# Check for instruction following breakdown
if current_behavior.follows_system_instructions < 0.8:
changes.append("instruction_breakdown")
# Check for unexpected tool usage
if set(current_behavior.tools_used) != set(baseline_behavior.tools_used):
changes.append("tool_usage_change")
return changes
10. Technique 10: Statistical Anomaly Detection
Some agent failures are statistical outliers — behaviors that fall outside normal operating parameters. Automated anomaly detection catches these edge cases.
# Statistical anomaly detector
class AgentAnomalyDetector:
def __init__(self, baseline_data):
self.baseline = self.calculate_baseline_stats(baseline_data)
def detect_anomalies(self, current_session):
"""Find statistical anomalies in agent behavior."""
current_stats = self.calculate_session_stats(current_session)
anomalies = []
for metric, value in current_stats.items():
baseline_mean = self.baseline[metric]["mean"]
baseline_std = self.baseline[metric]["std"]
# Calculate z-score
z_score = abs(value - baseline_mean) / baseline_std
if z_score > 3.0: # 3 standard deviations
anomalies.append({
"metric": metric,
"current_value": value,
"baseline_mean": baseline_mean,
"z_score": z_score,
"severity": "high" if z_score > 5 else "medium"
})
return anomalies
def calculate_session_stats(self, session):
"""Calculate behavioral statistics for a session."""
return {
"response_length_avg": np.mean([len(r.content) for r in session.responses]),
"tool_calls_per_response": len(session.tool_calls) / len(session.responses),
"memory_searches_per_response": len(session.memory_searches) / len(session.responses),
"confidence_score_avg": np.mean([r.confidence for r in session.responses]),
"processing_time_avg": np.mean([r.processing_time for r in session.responses]),
"error_rate": session.error_count / session.total_operations,
"context_utilization_avg": np.mean([r.context_tokens / r.max_tokens for r in session.responses])
}
Track these key metrics for baseline establishment:
• Response length and confidence scores
• Tool usage patterns and error rates
• Memory search frequency and results
• Processing time and token utilization
• Decision confidence and reasoning quality
The debugging arsenal is complete. These 10 techniques cover the most common AI agent failure modes. Start with memory replay for stateful issues, use A/B comparison for recent changes, and employ anomaly detection for mysterious edge cases.
Remember: AI agents fail differently than traditional software. The bugs are emergent, context-dependent, and often invisible in logs. These techniques give you visibility into the black box, turning mysterious failures into debuggable problems.
Debug Your Agents with Memory Spine
Memory timeline, tool call tracing, and anomaly detection built-in. Find bugs faster with complete agent observability.
Start Debugging →