1. The Context Injection Pattern
Most prompt engineering guides focus on instruction design — how to ask the LLM to do things. But the bigger quality lever is context injection — what information you provide before asking.
Here’s the pattern that changed everything for our agents:
# Standard prompt (no memory)
"Review this pull request and suggest improvements."
# Memory-augmented prompt
"""
## Context
Based on our recent discussions:
- You prefer functional programming patterns
- The team follows strict error handling guidelines
- Performance is critical for this microservice
## Recent Decisions
- PR #847: Decided against using async/await for this component
- Meeting 1/15: Agreed to prioritize readability over brevity
- Code review 1/18: Established 90ms latency budget
## Review Request
Review this pull request and suggest improvements:
[PR content]
"""
The difference? Context transforms generic feedback into specific, actionable guidance. Instead of “consider error handling,” you get “this violates our established error handling pattern — wrap network calls in Result<T> as discussed in PR #847.”
Memory-augmented prompts consistently deliver:
- 40% better relevance scores (measured by human raters)
- 60% fewer follow-up clarifications needed
- 25% reduction in back-and-forth iterations
- Consistent voice and preferences across sessions
The key insight: LLMs are incredibly good at synthesis when given the right inputs. The limitation isn’t model capacity — it’s context poverty.
2. Technique 1: Pinned System Context
Pinned context is your agent’s persistent personality and core knowledge. Think of it as the ~/.bashrc of AI memory — configuration that loads with every session.
What to Pin
- Preferences: Code style, communication tone, decision-making criteria
- Domain knowledge: Company-specific concepts, architecture patterns
- Guardrails: Things to avoid, policies to follow
- Context shortcuts: Abbreviations and internal terminology
# Memory Spine implementation
from memory_spine import MemorySpine
memory = MemorySpine()
# Pin persistent agent context
memory.pin_memory(
key="system_context",
content="""
## Agent Personality
- Direct, technical communication
- Prefer working code examples over theory
- Ask clarifying questions when requirements are ambiguous
- Flag security issues immediately
## Our Architecture
- Microservices: Auth, API Gateway, User Service, Analytics
- Stack: Python/FastAPI, PostgreSQL, Redis, Kubernetes
- Code style: Black formatting, type hints required
- Testing: pytest, 90% coverage target
## Current Projects
- Q1 2026: User service performance optimization
- Ongoing: Migration from REST to GraphQL
- Security: SOC2 compliance implementation
""",
importance=10.0 # Highest importance = always retrieved
)
Now every prompt automatically includes this context. Your agent starts each conversation already knowing your preferences, architecture, and current focus areas.
Dynamic Pinning
Pin context can evolve. When priorities shift or new decisions are made, update the pinned memory:
# Update pinned context when priorities change
memory.update_pin(
key="system_context",
append="""
## Updated Priorities (Jan 2026)
- Performance optimization postponed to Q2
- New focus: Security hardening for SOC2 audit
- Freeze on new feature development until audit complete
"""
)
Keep pinned context under 500 tokens to preserve budget for dynamic context. Update pins weekly or when major decisions change. Pin 3-5 key areas maximum — too much pinning dilutes focus.
3. Technique 2: Recent Conversation Summary
Long conversations exceed context windows, but you don’t want to lose the thread. The solution: progressive summarization of recent exchanges.
# Automatic conversation summarization
def summarize_conversation(messages, max_tokens=1000):
"""Compress conversation history into key decisions and context."""
# Extract key decisions, action items, and context
summary_prompt = f"""
Summarize this conversation, focusing on:
- Decisions made and rationale
- Action items and next steps
- Technical details that affect future work
- User preferences expressed
Conversation: {messages}
Format as bullet points, most recent items first.
"""
summary = llm.generate(summary_prompt)
# Store in memory for future reference
memory.store_memory(
content=summary,
tags=["conversation_summary", "recent_context"],
importance=7.0
)
return summary
The summarization happens automatically when conversations reach ~75% of context window capacity. Recent summaries get injected into new prompts to maintain conversational continuity.
Before/After Comparison
| Without Summary | With Summary |
|---|---|
| "I need help with authentication." | "Continuing from yesterday’s session: implementing JWT refresh tokens for the auth service..." |
| "Can you review this API design?" | "Based on our GraphQL migration plan and the performance requirements we discussed..." |
| "There’s a bug in the user service." | "Another issue with the user service? Given the connection pooling problems we debugged last week..." |
The agent maintains context across session boundaries, building on previous work rather than starting fresh each time.
4. Technique 3: Relevant Memory Search
This is the core value of persistent memory: automatically surfacing relevant context based on the current query. No manual lookup required.
# Semantic memory retrieval
def inject_relevant_context(query, max_memories=5):
"""Find and inject relevant memories into prompt context."""
# Search for semantically similar memories
relevant_memories = memory.search_memories(
query=query,
limit=max_memories,
min_relevance=0.7, # Only high-relevance matches
recency_boost=0.3 # Prefer recent memories
)
if not relevant_memories:
return query
# Format context for injection
context = "## Relevant Context\n"
for mem in relevant_memories:
context += f"- {mem.timestamp}: {mem.content}\n"
# Inject before the main query
return f"{context}\n## Current Request\n{query}"
Smart Filtering
Raw semantic search often returns too much irrelevant content. Memory Spine’s filtering improves precision:
- Relevance threshold: Only memories with 70%+ similarity scores
- Recency boost: Weight recent memories 30% higher
- Diversity filtering: Avoid returning multiple similar memories
- Importance weighting: High-importance memories surface more easily
# Example: Query "How do we handle database migrations?"
## Relevant Context
- 2026-01-15: Established migration policy: always use
backwards-compatible migrations, test on staging first
- 2026-01-12: DB migration failed in production due to
missing index, caused 15min downtime
- 2025-12-08: Team decision to use Alembic for all
schema changes, no manual SQL
## Current Request
How do we handle database migrations?
The agent now answers based on your actual history and established practices, not generic advice.
5. Technique 4: Knowledge Graph Context
Sometimes relevant context isn’t just similar content — it’s connected content. Knowledge graphs surface relationships between memories that pure semantic search misses.
# Build memory relationships
memory.add_relationship(
from_memory="user_service_optimization",
to_memory="database_query_patterns",
relationship="depends_on"
)
memory.add_relationship(
from_memory="authentication_refactor",
to_memory="user_service_optimization",
relationship="blocks"
)
# Retrieve connected context
def get_connected_context(query):
"""Find memories connected to the query topic."""
# Find directly relevant memories
direct_memories = memory.search_memories(query, limit=3)
# Find memories connected to relevant ones
connected_memories = []
for mem in direct_memories:
connections = memory.get_connected_memories(
memory_id=mem.id,
max_depth=2, # 2 hops maximum
relationship_types=["depends_on", "relates_to", "blocks"]
)
connected_memories.extend(connections)
return direct_memories + connected_memories
Relationship Types
Different relationships provide different types of context:
- depends_on: Prerequisites and dependencies
- blocks: What’s waiting on this decision/task
- relates_to: General topical connections
- conflicts_with: Contradictory decisions or approaches
- replaces: Outdated approaches or decisions
"When someone asks about user service performance, I don’t just want memories about performance. I want memories about the authentication refactor that depends on performance work, the database patterns that affect performance, and the SOC2 requirements that constrain our optimization approach." — Lead Engineer, ChaozCode
6. Technique 5: Temporal Context Windows
Not all context is equally relevant over time. A decision made yesterday is more relevant than one made six months ago — unless it’s a fundamental architectural choice that still applies.
# Time-aware context retrieval
def get_temporal_context(query, window_strategy="adaptive"):
"""Retrieve context with temporal relevance weighting."""
if window_strategy == "adaptive":
# Recent context for tactical questions
if is_tactical_query(query):
memories = memory.search_memories(
query=query,
time_window="last_30_days",
boost_recent=0.5
)
# Longer context for strategic questions
else:
memories = memory.search_memories(
query=query,
time_window="last_6_months",
boost_recent=0.2
)
elif window_strategy == "milestone":
# Context since last major milestone
last_milestone = get_last_milestone() # Sprint, release, etc.
memories = memory.search_memories(
query=query,
since_timestamp=last_milestone.timestamp
)
return memories
Temporal Decay Functions
Different types of memories decay at different rates:
| Memory Type | Decay Rate | Half-Life | Example |
|---|---|---|---|
| Tactical decisions | Fast | 2 weeks | Sprint planning, bug triage |
| Strategic decisions | Slow | 6 months | Architecture choices, team processes |
| User preferences | Very slow | 1 year | Code style, communication style |
| Situational context | Very fast | 3 days | Current debugging session |
7. Token Budget Allocation Strategy
Context injection consumes tokens, and context windows aren’t infinite. You need a strategy for allocating your token budget across different types of context.
Based on 10,000+ memory-augmented conversations:
System context: 20% (pinned memories, agent personality)
Memory context: 30% (relevant retrieved memories)
User query: 50% (current request and immediate context)
# Token budget management
class ContextBudget:
def __init__(self, total_tokens=100000):
self.total_tokens = total_tokens
self.allocations = {
"system": int(total_tokens * 0.20),
"memory": int(total_tokens * 0.30),
"user": int(total_tokens * 0.50)
}
def build_context(self, query):
context_parts = []
remaining_budget = self.allocations.copy()
# 1. Add pinned system context (highest priority)
system_context = memory.get_pinned_context()
if len(system_context) <= remaining_budget["system"]:
context_parts.append(system_context)
remaining_budget["system"] -= len(system_context)
# 2. Add relevant memories (within budget)
memories = memory.search_memories(query, limit=10)
memory_context = ""
for mem in memories:
mem_tokens = estimate_tokens(mem.content)
if mem_tokens <= remaining_budget["memory"]:
memory_context += f"{mem.content}\n"
remaining_budget["memory"] -= mem_tokens
else:
break
if memory_context:
context_parts.append(memory_context)
# 3. Add user query (guaranteed space)
context_parts.append(query)
return "\n\n".join(context_parts)
Dynamic Budget Adjustment
Adjust allocations based on query complexity:
- Simple queries: More budget for memory context (40%)
- Complex queries: More budget for user query (60%)
- New users: More budget for system context (30%)
- Established users: Less system context, more memory (40%)
Before/After Quality Comparison
We measured response quality across 1,000 queries, comparing naive prompting vs. memory-augmented prompting:
| Metric | Baseline | Memory-Augmented | Improvement |
|---|---|---|---|
| Relevance Score | 6.2/10 | 8.7/10 | +40% |
| Accuracy | 71% | 89% | +25% |
| Consistency | 5.8/10 | 8.4/10 | +45% |
| User Satisfaction | 7.1/10 | 9.2/10 | +30% |
The results are clear: memory context isn’t just helpful, it’s transformative. Agents go from providing generic advice to giving specific, personalized, contextually-aware guidance that feels like working with a knowledgeable colleague.
Start Building Memory-Augmented Prompts
Try these techniques with Memory Spine. Free tier includes everything you need to experiment with context injection.
Get Started Free →