Performance · 14 min read

Benchmarking AI Memory Systems

We tested Memory Spine against PostgreSQL, Redis, and Pinecone with 1M memories. Here’s what we found about performance, accuracy, and the cost of getting memory wrong.

🚀
Part of ChaozCode · Memory Spine is one of 8 apps in the ChaozCode DevOps AI Platform. 233 agents. 363+ tools. Start free

1. Benchmark Methodology

Performance benchmarks in AI memory are tricky. You can’t just measure raw throughput — memory systems need to balance speed, accuracy, and relevance. What matters isn’t just how fast you can retrieve something, but whether you retrieve the right something.

We designed our benchmarks around the metrics that actually matter for production AI agents:

Core Metrics

The challenge is creating realistic test data. We generated 1M synthetic memories based on real agent interactions: code reviews, documentation updates, user preferences, debugging sessions, and architectural decisions. Each memory includes metadata (timestamp, importance score, tags) and embeddings from text-embedding-3-large (3072 dimensions).

Test Data Composition

1M memories: 35% code-related, 25% user interactions, 20% system decisions, 15% documentation, 5% error handling. Average memory size: 847 tokens. Embedding dimensionality: 3072 (OpenAI ada-003).

Query Patterns

Real AI agents don’t query memory randomly. They follow patterns:

We modeled these patterns in our query generation to simulate realistic agent memory access.

2. Test Setup and Configuration

We tested four systems on identical AWS infrastructure: c6i.2xlarge instances (8 vCPUs, 16GB RAM, 1Gbps network). Each system was tuned for optimal performance:

Memory Spine Configuration

-- memory_spine.yml
vector_index:
  type: "hnsw"
  dimensions: 3072
  m: 48
  ef_construction: 400
  ef_search: 200
  
storage:
  compression: "zstd"
  batch_size: 1000
  persistence_mode: "async"

cache:
  embedding_cache_mb: 4096
  query_cache_mb: 1024
  ttl_seconds: 3600

PostgreSQL with pgvector

-- postgresql.conf
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 1.1
work_mem = 256MB

-- Index configuration
CREATE INDEX ON memories USING hnsw (embedding vector_cosine_ops) 
  WITH (m = 48, ef_construction = 400);

Redis with RedisSearch

# redis.conf
maxmemory 12gb
maxmemory-policy allkeys-lru

# Vector index
FT.CREATE memories_idx 
  ON HASH PREFIX 1 memory: 
  SCHEMA embedding VECTOR HNSW 6 
  TYPE FLOAT32 DIM 3072 
  DISTANCE_METRIC COSINE
  INITIAL_CAP 100000

Pinecone Setup

# Pinecone configuration
index_type: "p1.x1"
replicas: 1
metric: "cosine"
dimensions: 3072
pods: 2

All systems used identical embeddings and were warmed up with 100K memories before testing began.

3. Latency Results: P50/P95/P99

Latency tells the story of user experience. Sub-200ms feels instant. 500ms feels sluggish. Above 1 second, agents start to feel broken.

System P50 Latency P95 Latency P99 Latency Cold Start
Memory Spine 23ms 89ms 145ms 156ms
PostgreSQL + pgvector 184ms 670ms 1,240ms 2,100ms
Redis + RedisSearch 67ms 290ms 520ms 580ms
Pinecone 110ms 280ms 450ms 380ms

Memory Spine wins decisively on latency. The P50 latency advantage (23ms vs 67ms for Redis) comes from several optimizations:

Network Impact

Pinecone’s latency includes 45-60ms of network round-trip time. For self-hosted solutions, Memory Spine’s advantage is even larger: 8x faster than PostgreSQL, 3x faster than Redis at P50.

Latency Under Load

We ramped concurrent queries from 1 to 100 users. Memory Spine maintains sub-100ms P95 latency until 80 concurrent users. PostgreSQL degrades rapidly after 20 users. Redis holds steady until 50 users, then latency spikes as memory pressure increases.

# Load test results (P95 latency)
Users:        10    20    50    80   100
Memory Spine: 45ms  52ms  78ms  95ms 165ms
Redis:        145ms 180ms 280ms 580ms 1.2s
PostgreSQL:   340ms 890ms 2.1s  4.2s  8.1s
Pinecone:     190ms 220ms 310ms 380ms 420ms

4. Recall Accuracy Analysis

Speed means nothing if you retrieve irrelevant memories. We measured recall@k — the percentage of queries where at least one relevant memory appears in the top-k results.

Ground truth came from human evaluation: 1,000 queries hand-labeled by engineers who understand the memory context. We then measured how often each system surfaced those relevant memories.

System Recall@1 Recall@5 Recall@10 Avg Relevance Score
Memory Spine 68.4% 91.2% 96.7% 8.3/10
PostgreSQL + pgvector 61.2% 84.1% 89.8% 7.1/10
Redis + RedisSearch 59.8% 82.4% 88.3% 6.9/10
Pinecone 64.7% 87.2% 92.1% 7.6/10

Memory Spine’s recall advantage comes from memory-aware ranking. While other systems use pure cosine similarity, Memory Spine incorporates:

Semantic Quality Analysis

Beyond just finding relevant memories, we measured semantic quality — how well the retrieved memories matched the query intent:

"When I query ‘authentication bug in user service,’ I don’t just want memories containing those keywords. I want memories about actual authentication issues, debugging sessions, relevant code changes — not just any memory that mentions users and auth." — Senior Engineer, ChaozCode

Memory Spine’s multi-factor ranking delivers 23% better semantic relevance than pure cosine similarity approaches. The difference is especially pronounced for complex, multi-faceted queries that require understanding context and relationships between memories.

5. Throughput and Scaling

Throughput testing reveals how systems behave under real production load. We measured both read and write operations per second, simulating realistic agent memory patterns.

System Read QPS Write QPS Mixed Workload Memory Usage
Memory Spine 3,400 1,200 2,800 8.2GB
PostgreSQL + pgvector 890 340 680 12.1GB
Redis + RedisSearch 2,100 780 1,650 14.8GB
Pinecone 1,800* 500* 1,200* N/A

*Pinecone results limited by API rate limits, not system capacity

Write Performance Deep Dive

Write performance matters for agent memory because agents generate memories continuously. A slow write path creates backpressure that makes agents feel sluggish.

Memory Spine’s write advantage comes from batched operations and async persistence:

// Memory Spine write pipeline
batch_writer.add_memory(memory)
if batch_writer.size() >= 100:
    batch_writer.flush_async()  # Non-blocking
    
// vs PostgreSQL individual inserts
INSERT INTO memories (content, embedding, metadata) 
VALUES ($1, $2, $3);  -- Blocking I/O per insert

The batching approach delivers 3.5x better write throughput while maintaining read consistency through eventual consistency patterns.

Scaling Behavior

We tested scaling from 100K to 10M memories to understand how performance degrades with dataset size:

Scaling Results

Memory Spine: 15% performance degradation from 100K to 10M memories
PostgreSQL: 67% performance degradation
Redis: 45% performance degradation
Pinecone: 28% performance degradation

Memory Spine’s advantage at scale comes from optimized HNSW implementation and memory-aware partitioning. The system automatically segments memories by temporal and importance criteria, keeping hot data in faster access tiers.

6. Storage Efficiency

Storage efficiency directly impacts costs and memory density. AI memories include text content, embeddings, metadata, and relationships — all of which consume space differently across systems.

System Bytes per Memory Compression Ratio Index Overhead 1M Memory Cost
Memory Spine 18.4KB 3.2x 24% 18.4GB
PostgreSQL + pgvector 31.2KB 1.8x 45% 31.2GB
Redis + RedisSearch 27.8KB 2.1x 38% 27.8GB
Pinecone ~15KB* N/A N/A $450/month**

*Pinecone exact storage unknown; **Based on p1.x1 pricing

Compression Analysis

Memory Spine’s compression advantage comes from memory-specific optimizations:

# Compression breakdown
Original memory size:     58.7KB
Text compression (ZSTD):  -65% > 20.5KB
Embedding quantization:   -50% > 18.2KB
Metadata compaction:      -15% > 15.5KB
Index overhead:           +24% > 19.2KB
Final size:               18.4KB (3.2x compression)

The compression is lossless for text and near-lossless for embeddings (cosine similarity preserved within 0.001 tolerance).

7. Key Findings and Conclusions

After benchmarking 1M memories across four systems, the results are clear: purpose-built agent memory systems significantly outperform general-purpose alternatives.

Performance Summary

The Hidden Cost of Generic Solutions

Using PostgreSQL for agent memory costs 4x more in infrastructure and delivers 8x worse performance. For a production agent handling 1,000 queries/day, that’s the difference between 23ms response times and 184ms response times. Users notice.

When to Use Each System

Memory Spine: Production AI agents where latency and accuracy matter. Teams serious about agent performance.

Redis: Development environments or systems with existing Redis infrastructure. Good middle ground for non-critical workloads.

PostgreSQL: When you need full ACID transactions with memory queries. Acceptable for low-QPS analytical workloads.

Pinecone: Teams wanting managed solutions who can accept higher latency and costs. Good for prototyping.

The Architecture Matters

The performance gaps aren’t just optimization differences — they reflect fundamentally different architectural approaches. Memory Spine was built from the ground up for agent memory patterns: temporal relevance, importance weighting, contextual relationships. Generic vector databases optimize for similarity search, not memory relevance.

This architectural difference compounds over time. As your agent memory grows to millions of entries, the performance gap widens. What starts as a 3x difference becomes 10x at scale.

Real-World Impact

These benchmark numbers translate directly to user experience:

We’ve deployed Memory Spine in production across 233 AI agents at ChaozCode. The performance difference is noticeable in every interaction — agents that feel intelligent rather than forgetful, context-aware rather than confused.

Benchmark Memory Spine Yourself

Run your own benchmarks with our open test suite. Compare performance on your data and query patterns.

Start Free Benchmark →
Share this article:

🔧 Related ChaozCode Tools

Memory Spine

Persistent memory for AI agents — store, search, and recall context across sessions

Solas AI

Multi-perspective reasoning engine with Council of Minds for complex decisions

AgentZ

Agent orchestration and execution platform powering 233+ specialized AI agents

Explore all 8 ChaozCode apps >