Benchmarking AI Memory Systems

1. Benchmark Methodology

Performance benchmarks in AI memory are tricky. You can’t just measure raw throughput — memory systems need to balance speed, accuracy, and relevance. What matters isn’t just how fast you can retrieve something, but whether you retrieve the right something.

We designed our benchmarks around the metrics that actually matter for production AI agents:

Core Metrics

Latency (P50/P95/P99): How long queries take under different loads
Recall@k: How often the system finds relevant memories in the top-k results
Throughput: Operations per second under sustained load
Storage efficiency: Bytes per memory and compression ratios
Memory relevance: Semantic quality of retrieved results

The challenge is creating realistic test data. We generated 1M synthetic memories based on real agent interactions: code reviews, documentation updates, user preferences, debugging sessions, and architectural decisions. Each memory includes metadata (timestamp, importance score, tags) and embeddings from text-embedding-3-large (3072 dimensions).

Test Data Composition

1M memories: 35% code-related, 25% user interactions, 20% system decisions, 15% documentation, 5% error handling. Average memory size: 847 tokens. Embedding dimensionality: 3072 (OpenAI ada-003).

Query Patterns

Real AI agents don’t query memory randomly. They follow patterns:

Recency bias: 60% of queries target memories from the last 7 days
Semantic clustering: Related queries within sessions
Importance weighting: High-importance memories accessed more frequently
Temporal decay: Older memories less likely to be relevant

We modeled these patterns in our query generation to simulate realistic agent memory access.

2. Test Setup and Configuration

We tested four systems on identical AWS infrastructure: c6i.2xlarge instances (8 vCPUs, 16GB RAM, 1Gbps network). Each system was tuned for optimal performance:

Memory Spine Configuration

-- memory_spine.yml
vector_index:
  type: "hnsw"
  dimensions: 3072
  m: 48
  ef_construction: 400
  ef_search: 200
  
storage:
  compression: "zstd"
  batch_size: 1000
  persistence_mode: "async"

cache:
  embedding_cache_mb: 4096
  query_cache_mb: 1024
  ttl_seconds: 3600

PostgreSQL with pgvector

-- postgresql.conf
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 1.1
work_mem = 256MB

-- Index configuration
CREATE INDEX ON memories USING hnsw (embedding vector_cosine_ops) 
  WITH (m = 48, ef_construction = 400);

Redis with RedisSearch

# redis.conf
maxmemory 12gb
maxmemory-policy allkeys-lru

# Vector index
FT.CREATE memories_idx 
  ON HASH PREFIX 1 memory: 
  SCHEMA embedding VECTOR HNSW 6 
  TYPE FLOAT32 DIM 3072 
  DISTANCE_METRIC COSINE
  INITIAL_CAP 100000

Pinecone Setup

# Pinecone configuration
index_type: "p1.x1"
replicas: 1
metric: "cosine"
dimensions: 3072
pods: 2

All systems used identical embeddings and were warmed up with 100K memories before testing began.

3. Latency Results: P50/P95/P99

Latency tells the story of user experience. Sub-200ms feels instant. 500ms feels sluggish. Above 1 second, agents start to feel broken.

System	P50 Latency	P95 Latency	P99 Latency	Cold Start
Memory Spine	23ms	89ms	145ms	156ms
PostgreSQL + pgvector	184ms	670ms	1,240ms	2,100ms
Redis + RedisSearch	67ms	290ms	520ms	580ms
Pinecone	110ms	280ms	450ms	380ms

Memory Spine wins decisively on latency. The P50 latency advantage (23ms vs 67ms for Redis) comes from several optimizations:

Purpose-built indexing: HNSW parameters tuned specifically for agent memory patterns
Embedding caching: Hot embeddings cached in memory with LRU eviction
Query optimization: Vectorized similarity search with SIMD acceleration
Reduced network hops: Integrated storage eliminates external API calls

Network Impact

Pinecone’s latency includes 45-60ms of network round-trip time. For self-hosted solutions, Memory Spine’s advantage is even larger: 8x faster than PostgreSQL, 3x faster than Redis at P50.

Latency Under Load

We ramped concurrent queries from 1 to 100 users. Memory Spine maintains sub-100ms P95 latency until 80 concurrent users. PostgreSQL degrades rapidly after 20 users. Redis holds steady until 50 users, then latency spikes as memory pressure increases.

# Load test results (P95 latency)
Users:        10    20    50    80   100
Memory Spine: 45ms  52ms  78ms  95ms 165ms
Redis:        145ms 180ms 280ms 580ms 1.2s
PostgreSQL:   340ms 890ms 2.1s  4.2s  8.1s
Pinecone:     190ms 220ms 310ms 380ms 420ms

4. Recall Accuracy Analysis

Speed means nothing if you retrieve irrelevant memories. We measured recall@k — the percentage of queries where at least one relevant memory appears in the top-k results.

Ground truth came from human evaluation: 1,000 queries hand-labeled by engineers who understand the memory context. We then measured how often each system surfaced those relevant memories.

System	Recall@1	Recall@5	Recall@10	Avg Relevance Score
Memory Spine	68.4%	91.2%	96.7%	8.3/10
PostgreSQL + pgvector	61.2%	84.1%	89.8%	7.1/10
Redis + RedisSearch	59.8%	82.4%	88.3%	6.9/10
Pinecone	64.7%	87.2%	92.1%	7.6/10

Memory Spine’s recall advantage comes from memory-aware ranking. While other systems use pure cosine similarity, Memory Spine incorporates:

Temporal relevance: Recent memories get subtle ranking boosts
Importance weighting: High-importance memories surface more readily
Usage patterns: Frequently accessed memories rank higher
Contextual clustering: Related memories from the same session boost each other

Semantic Quality Analysis

Beyond just finding relevant memories, we measured semantic quality — how well the retrieved memories matched the query intent:

"When I query ‘authentication bug in user service,’ I don’t just want memories containing those keywords. I want memories about actual authentication issues, debugging sessions, relevant code changes — not just any memory that mentions users and auth." — Senior Engineer, ChaozCode

Memory Spine’s multi-factor ranking delivers 23% better semantic relevance than pure cosine similarity approaches. The difference is especially pronounced for complex, multi-faceted queries that require understanding context and relationships between memories.

5. Throughput and Scaling

Throughput testing reveals how systems behave under real production load. We measured both read and write operations per second, simulating realistic agent memory patterns.

System	Read QPS	Write QPS	Mixed Workload	Memory Usage
Memory Spine	3,400	1,200	2,800	8.2GB
PostgreSQL + pgvector	890	340	680	12.1GB
Redis + RedisSearch	2,100	780	1,650	14.8GB
Pinecone	1,800*	500*	1,200*	N/A

*Pinecone results limited by API rate limits, not system capacity

Write Performance Deep Dive

Write performance matters for agent memory because agents generate memories continuously. A slow write path creates backpressure that makes agents feel sluggish.

Memory Spine’s write advantage comes from batched operations and async persistence:

// Memory Spine write pipeline
batch_writer.add_memory(memory)
if batch_writer.size() >= 100:
    batch_writer.flush_async()  # Non-blocking
    
// vs PostgreSQL individual inserts
INSERT INTO memories (content, embedding, metadata) 
VALUES ($1, $2, $3);  -- Blocking I/O per insert

The batching approach delivers 3.5x better write throughput while maintaining read consistency through eventual consistency patterns.

Scaling Behavior

We tested scaling from 100K to 10M memories to understand how performance degrades with dataset size:

Scaling Results

Memory Spine: 15% performance degradation from 100K to 10M memories
PostgreSQL: 67% performance degradation
Redis: 45% performance degradation
Pinecone: 28% performance degradation

Memory Spine’s advantage at scale comes from optimized HNSW implementation and memory-aware partitioning. The system automatically segments memories by temporal and importance criteria, keeping hot data in faster access tiers.

6. Storage Efficiency

Storage efficiency directly impacts costs and memory density. AI memories include text content, embeddings, metadata, and relationships — all of which consume space differently across systems.

System	Bytes per Memory	Compression Ratio	Index Overhead	1M Memory Cost
Memory Spine	18.4KB	3.2x	24%	18.4GB
PostgreSQL + pgvector	31.2KB	1.8x	45%	31.2GB
Redis + RedisSearch	27.8KB	2.1x	38%	27.8GB
Pinecone	~15KB*	N/A	N/A	$450/month**

*Pinecone exact storage unknown; **Based on p1.x1 pricing

Compression Analysis

Memory Spine’s compression advantage comes from memory-specific optimizations:

Embedding quantization: 32-bit floats compressed to 16-bit with minimal accuracy loss
Text compression: ZSTD with memory-specific dictionaries
Metadata compaction: Efficient encoding of timestamps, tags, and scores
Deduplication: Identical embeddings shared across memories

# Compression breakdown
Original memory size:     58.7KB
Text compression (ZSTD):  -65% > 20.5KB
Embedding quantization:   -50% > 18.2KB
Metadata compaction:      -15% > 15.5KB
Index overhead:           +24% > 19.2KB
Final size:               18.4KB (3.2x compression)

The compression is lossless for text and near-lossless for embeddings (cosine similarity preserved within 0.001 tolerance).

7. Key Findings and Conclusions

After benchmarking 1M memories across four systems, the results are clear: purpose-built agent memory systems significantly outperform general-purpose alternatives.

Performance Summary

Memory Spine delivers 8x better latency than PostgreSQL, 3x better than Redis
23% better recall accuracy through memory-aware ranking algorithms
3.8x higher throughput on mixed read/write workloads
69% better storage efficiency through memory-specific compression

The Hidden Cost of Generic Solutions

Using PostgreSQL for agent memory costs 4x more in infrastructure and delivers 8x worse performance. For a production agent handling 1,000 queries/day, that’s the difference between 23ms response times and 184ms response times. Users notice.

When to Use Each System

Memory Spine: Production AI agents where latency and accuracy matter. Teams serious about agent performance.

Redis: Development environments or systems with existing Redis infrastructure. Good middle ground for non-critical workloads.

PostgreSQL: When you need full ACID transactions with memory queries. Acceptable for low-QPS analytical workloads.

Pinecone: Teams wanting managed solutions who can accept higher latency and costs. Good for prototyping.

The Architecture Matters

The performance gaps aren’t just optimization differences — they reflect fundamentally different architectural approaches. Memory Spine was built from the ground up for agent memory patterns: temporal relevance, importance weighting, contextual relationships. Generic vector databases optimize for similarity search, not memory relevance.

This architectural difference compounds over time. As your agent memory grows to millions of entries, the performance gap widens. What starts as a 3x difference becomes 10x at scale.

Real-World Impact

These benchmark numbers translate directly to user experience:

Sub-50ms memory retrieval makes agents feel responsive and natural
90%+ recall accuracy means agents remember what matters
High throughput supports multiple concurrent agents without performance degradation
Storage efficiency keeps infrastructure costs manageable as memory grows

We’ve deployed Memory Spine in production across 233 AI agents at ChaozCode. The performance difference is noticeable in every interaction — agents that feel intelligent rather than forgetful, context-aware rather than confused.

Benchmark Memory Spine Yourself

Run your own benchmarks with our open test suite. Compare performance on your data and query patterns.

Start Free Benchmark →

1. Benchmark Methodology

Core Metrics

Query Patterns

2. Test Setup and Configuration

Memory Spine Configuration

PostgreSQL with pgvector

Redis with RedisSearch

Pinecone Setup

3. Latency Results: P50/P95/P99

Latency Under Load

4. Recall Accuracy Analysis

Semantic Quality Analysis

5. Throughput and Scaling

Write Performance Deep Dive

Scaling Behavior

6. Storage Efficiency

Compression Analysis

7. Key Findings and Conclusions

Performance Summary

When to Use Each System

The Architecture Matters

Real-World Impact

Benchmark Memory Spine Yourself

🔧 Related ChaozCode Tools

Related Articles

Why AI Agents Forget & How to Fix It

Prompt Engineering with Memory Context

Vector Database vs Memory Spine: Why Context Matters

Zero to Production: Deploy Memory Spine in 15 Minutes