Architecture · 11 min read

RAG in Production: Beyond the Demo with Memory Spine

Every RAG tutorial shows you how to split documents and query a vector store. None of them show you what happens when your retrieval pipeline processes 50,000 queries per day, relevance degrades silently, and your chunking strategy fails on real-world documents. This is the production guide.

🚀
Part of ChaozCode · Memory Spine is one of 8 apps in the ChaozCode DevOps AI Platform. 233 agents. 363+ tools. Start free

1. The Demo-to-Production Gap

Building a RAG demo takes an afternoon. You split a PDF into chunks, embed them with OpenAI, store the vectors in a database, and wire up a retrieval query. It works beautifully on your test documents.

Then you deploy it to production and everything falls apart.

The first problem: your fixed-size chunking splits a critical paragraph in half, and the retriever returns the wrong half. The second problem: a user asks a question using terminology your documents don't use, and semantic search returns irrelevant results. The third problem: you have no idea any of this is happening because you're not measuring retrieval quality.

The gap between a RAG demo and a production RAG system is enormous. It's not about the technology — it's about all the operational concerns that tutorials never mention.

The Silent Failure Mode

RAG systems fail silently. Unlike a crash or timeout, bad retrieval produces a plausible-sounding but incorrect answer. Your users won't report it as a bug — they'll just lose trust in the system.

This guide covers the patterns and practices we've developed running retrieval-augmented generation at production scale with Memory Spine — where silent failures are not an option.

2. RAG Architecture Evolution

RAG has evolved through three generations. Understanding where your system sits helps you identify what to improve next.

Generation 1: Naive RAG

Split documents into fixed-size chunks. Embed each chunk. On query, embed the query, find the top-k nearest chunks, stuff them into the LLM context. Simple, and surprisingly effective for demos.

Failure modes: Context boundary problems. Irrelevant retrieval on out-of-vocabulary queries. No way to handle multi-hop reasoning ("What changed between version 2 and version 3?").

Generation 2: Advanced RAG

Better chunking (semantic, hierarchical). Query transformation (rewriting, decomposition, HyDE). Re-ranking retrieved results before sending to the LLM. Metadata filtering to narrow search scope.

Failure modes: Increased latency from the re-ranking step. Query transformation can distort intent. Still limited to single-pass retrieval.

Generation 3: Modular RAG

Retrieval becomes a multi-step pipeline with routing, iterative refinement, and adaptive strategies. The system can decide whether to retrieve at all, which retrieval method to use, and when to stop iterating.

class ModularRAGPipeline:
    """Generation 3: Adaptive retrieval pipeline"""

    def __init__(self, retriever, reranker, router, memory):
        self.retriever = retriever
        self.reranker = reranker
        self.router = router
        self.memory = memory

    async def answer(self, query: str) -> RAGResponse:
        # Step 1: Route — does this query need retrieval at all?
        route = await self.router.classify(query)
        if route == "direct":
            return await self.llm_direct(query)

        # Step 2: Transform query for better retrieval
        expanded_queries = await self.query_transform(query)

        # Step 3: Multi-strategy retrieval
        candidates = []
        for q in expanded_queries:
            candidates += await self.retriever.semantic_search(q, top_k=10)
            candidates += await self.retriever.keyword_search(q, top_k=5)

        # Step 4: Deduplicate and re-rank
        unique = self.deduplicate(candidates)
        ranked = await self.reranker.rank(query, unique, top_k=5)

        # Step 5: Check sufficiency
        if not self.is_sufficient(ranked, threshold=0.6):
            ranked += await self.iterative_retrieval(query, ranked)

        # Step 6: Generate with retrieved context
        response = await self.generate(query, ranked)

        # Step 7: Store interaction for learning
        await self.memory.store(query=query, response=response, sources=ranked)

        return response
Generation Impact

Moving from Gen 1 to Gen 3 RAG improved our answer accuracy from 72% to 94% on internal benchmarks. The biggest single improvement came from hybrid retrieval (semantic + keyword), which alone lifted accuracy by 11 percentage points.

3. Chunking Strategies That Actually Work

Chunking is where most RAG systems silently degrade. The wrong chunking strategy means even perfect retrieval returns incomplete or misleading context.

Fixed-Size Chunking (and Why It's Not Enough)

Splitting text every 512 tokens with 50-token overlap is the default in every tutorial. It's fast and simple. It also splits paragraphs mid-sentence, separates code from its explanation, and breaks tables in half.

Semantic Chunking

Split at natural boundaries: paragraphs, sections, function definitions. Respects the document's own structure.

class SemanticChunker:
    """Split documents at natural semantic boundaries"""

    def __init__(self, embedding_model, similarity_threshold=0.5):
        self.model = embedding_model
        self.threshold = similarity_threshold

    def chunk(self, text: str, max_tokens: int = 512) -> List[Chunk]:
        sentences = self.split_sentences(text)
        embeddings = self.model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            similarity = cosine_similarity(embeddings[i-1], embeddings[i])

            if similarity < self.threshold or self.token_count(current_chunk) > max_tokens:
                chunks.append(Chunk(text=" ".join(current_chunk)))
                current_chunk = [sentences[i]]
            else:
                current_chunk.append(sentences[i])

        if current_chunk:
            chunks.append(Chunk(text=" ".join(current_chunk)))

        return chunks

Hierarchical Chunking

Create chunks at multiple granularity levels — document, section, paragraph — and link them in a tree. When a paragraph-level chunk is retrieved, you can also pull in the parent section for broader context.

class HierarchicalChunker:
    """Multi-level chunking with parent-child relationships"""

    def chunk(self, document: Document) -> ChunkTree:
        tree = ChunkTree(root=Chunk(text=document.summary, level="document"))

        for section in document.sections:
            section_chunk = Chunk(
                text=section.text,
                level="section",
                metadata={"heading": section.heading}
            )
            tree.add_child(tree.root, section_chunk)

            for paragraph in section.paragraphs:
                para_chunk = Chunk(
                    text=paragraph.text,
                    level="paragraph",
                    metadata={"section": section.heading}
                )
                tree.add_child(section_chunk, para_chunk)

        return tree
StrategyProsConsBest For
Fixed-sizeSimple, fastBreaks contextHomogeneous text
SemanticNatural boundariesSlower, model-dependentProse documents
HierarchicalMulti-granularityComplex indexingStructured docs
Code-awareRespects ASTLanguage-specificSource code

4. Hybrid Retrieval: Vectors + Keywords

Pure semantic search fails when users use exact technical terms, error codes, or identifiers. Pure keyword search fails when users describe concepts in different words than the documents use. Hybrid retrieval combines both.

class HybridRetriever:
    """Combine semantic and keyword retrieval with reciprocal rank fusion"""

    def __init__(self, vector_store, keyword_index, alpha=0.7):
        self.vector_store = vector_store
        self.keyword_index = keyword_index
        self.alpha = alpha  # weight for semantic vs keyword

    async def retrieve(self, query: str, top_k: int = 10) -> List[RetrievedChunk]:
        # Parallel retrieval from both sources
        semantic_results, keyword_results = await asyncio.gather(
            self.vector_store.search(query, top_k=top_k * 2),
            self.keyword_index.search(query, top_k=top_k * 2)
        )

        # Reciprocal Rank Fusion (RRF)
        fused_scores = {}
        k = 60  # RRF constant

        for rank, result in enumerate(semantic_results):
            fused_scores[result.id] = fused_scores.get(result.id, 0) + \
                self.alpha * (1 / (k + rank + 1))

        for rank, result in enumerate(keyword_results):
            fused_scores[result.id] = fused_scores.get(result.id, 0) + \
                (1 - self.alpha) * (1 / (k + rank + 1))

        # Sort by fused score and return top_k
        ranked = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
        return [self.get_chunk(chunk_id) for chunk_id, _ in ranked[:top_k]]

The alpha parameter controls the balance. We've found 0.7 (70% semantic, 30% keyword) works well for technical documentation. For code search, we shift to 0.4 (more keyword-heavy, since exact function names and error codes matter more than semantic meaning).

5. Memory Spine as a RAG Backend

Memory Spine wasn't built as a RAG system — it was built as a persistent memory layer for AI agents. But it turns out that the requirements for production RAG and production agent memory overlap almost completely: store structured knowledge, retrieve relevant context fast, maintain freshness, and support multiple retrieval strategies.

Why Memory Spine Works for RAG

# Using Memory Spine as a RAG backend
from memory_spine import MemorySpineClient

async def rag_with_memory_spine(query: str) -> str:
    client = MemorySpineClient("http://localhost:8788")

    # Hybrid retrieval: semantic + tag-based filtering
    context = await client.llm_context_window(
        query=query,
        max_tokens=4000
    )

    # Context is already formatted for LLM consumption
    # Includes relevance scores, source metadata, and temporal ordering

    response = await llm.generate(
        system="Answer based on the provided context. Cite sources.",
        context=context.formatted_text,
        query=query
    )

    # Store the interaction for future retrieval improvement
    await client.store(
        content=f"Q: {query}\nA: {response}",
        tags=["rag-interaction", "qa-pair"],
        metadata={"query": query, "sources": context.source_ids}
    )

    return response

Ingestion Pipeline

async def ingest_documents(docs: List[Document]):
    """Ingest documents into Memory Spine for RAG"""
    client = MemorySpineClient("http://localhost:8788")

    for doc in docs:
        # Semantic chunking
        chunks = semantic_chunker.chunk(doc.text)

        for i, chunk in enumerate(chunks):
            await client.store(
                content=chunk.text,
                tags=["rag-source", f"doc:{doc.id}", f"section:{chunk.section}"],
                metadata={
                    "document_id": doc.id,
                    "chunk_index": i,
                    "source_url": doc.url,
                    "last_updated": doc.modified_at.isoformat()
                }
            )

    # Trigger consolidation to merge near-duplicate chunks
    await client.consolidate(decay_threshold=0.3)

6. Evaluation Metrics for Production RAG

You can't improve what you don't measure. These are the metrics that actually matter for production RAG systems.

Retrieval Metrics

Generation Metrics

class RAGEvaluator:
    """Automated evaluation pipeline for production RAG"""

    async def evaluate(self, query: str, answer: str, contexts: List[str]) -> RAGMetrics:
        # Retrieval quality
        context_relevance = await self.judge_relevance(query, contexts)

        # Generation quality
        faithfulness = await self.judge_faithfulness(answer, contexts)
        answer_relevance = await self.judge_answer_relevance(query, answer)

        # Groundedness check
        claims = await self.extract_claims(answer)
        grounded_claims = await self.verify_grounding(claims, contexts)
        groundedness = len(grounded_claims) / len(claims) if claims else 1.0

        return RAGMetrics(
            context_relevance=context_relevance,
            faithfulness=faithfulness,
            answer_relevance=answer_relevance,
            groundedness=groundedness
        )

    async def judge_faithfulness(self, answer: str, contexts: List[str]) -> float:
        """Use a judge LLM to score faithfulness"""
        prompt = f"""Score how faithful this answer is to the provided context.
        Score 1.0 if every claim is supported. Score 0.0 if it's entirely fabricated.

        Context: {' '.join(contexts)}
        Answer: {answer}
        Score (0.0-1.0):"""

        score = await self.judge_llm.generate(prompt)
        return float(score.strip())
Metrics That Moved the Needle

After implementing automated evaluation, we discovered that 23% of our RAG responses had faithfulness scores below 0.7 — meaning nearly a quarter of answers contained unsupported claims. Fixing the chunking strategy alone raised average faithfulness to 0.91.

7. Operational Lessons

After running RAG in production for over a year, these are the lessons that no tutorial teaches:

1. Index freshness is a feature. Stale indexes produce stale answers. Implement automated re-ingestion pipelines that detect source document changes and update chunks incrementally — not by rebuilding the entire index.

2. Chunk metadata is as important as chunk content. Store the source URL, section heading, document title, and last-modified date with every chunk. When the LLM cites a source, users should be able to verify it.

3. Monitor retrieval, not just generation. If your retrieval quality degrades, generation quality follows. Set up alerts on retrieval metrics (MRR, recall) and catch problems before users notice.

4. Query analytics reveal gaps. Log every query. Cluster them. Identify topics where retrieval consistently scores low. Those clusters tell you exactly which documents to add or improve.

5. Fallback gracefully. When retrieval returns nothing relevant (confidence below threshold), say "I don't have enough information to answer this" instead of hallucinating. Users trust honest uncertainty more than confident fabrication.

The difference between a RAG demo and a production RAG system isn't the retrieval algorithm — it's the evaluation pipeline, the monitoring, and the feedback loop that continuously improves retrieval quality.

Build Production RAG with Memory Spine

Memory Spine provides hybrid retrieval, automatic consolidation, and temporal awareness out of the box. Skip the infrastructure work and focus on your retrieval quality.

Start Building →
Share this article:

🔧 Related ChaozCode Tools

Memory Spine

Production-grade RAG backend with hybrid retrieval, consolidation, and temporal awareness

Zearch

Deep research engine with multi-source synthesis and intelligent search ranking

HelixHyper

Knowledge graph for relationship-aware retrieval beyond flat vector search

Explore all 8 ChaozCode apps >