Retrieval-Augmented Generation (RAG)

Production

Knowledge Strategy

An architecture that supplements an LLM's parametric knowledge by retrieving relevant documents from an external corpus at inference time, grounding generation in citable, updatable sources rather than relying solely on memorized training data.

RAG pipelines have three stages: (1) Indexing — documents are chunked (typically 256-1024 tokens), embedded via a vector model (e.g. text-embedding-3-small, BGE, E5), and stored in a vector database (Pinecone, Weaviate, pgvector, Qdrant). (2) Retrieval — at query time, the user's question is embedded and used for approximate nearest-neighbor search (ANN) against the index, returning top-k relevant chunks. (3) Generation — retrieved chunks are prepended to the LLM's context window alongside the query, and the model generates a grounded response. Frameworks: LangChain, LlamaIndex, Haystack. Key tuning knobs: chunk size, overlap, embedding model, retrieval top-k, reranking strategy.

Why Does This Exist?

Epistemic Control →Research Goal

RAG creates a structural separation between "knowledge I retrieved from a source" and "knowledge I'm generating from parameters" — the single most impactful technique for grounding claims in citable, verifiable evidence. When a RAG system attributes a claim to a retrieved passage, it provides the epistemic provenance that pure parametric generation cannot.

Persistent Machine Memory →Research Goal

The external retrieval corpus functions as a persistent, updatable knowledge store that survives beyond the context window and across sessions. Unlike parametric memory (frozen at training time), a RAG index can be updated in real-time — add a document and the model immediately has access to it. This is the simplest production path to AI systems with evolving, long-term knowledge.

Cost-Efficient Frontier Intelligence →Research Goal

Updating a retrieval index is orders of magnitude cheaper than retraining or fine-tuning a model. When knowledge changes (new product docs, updated regulations, recent events), RAG allows instant updates at the cost of index insertion rather than GPU-hours of training. This decouples knowledge freshness from training cost.

Raw Capability Scaling →Research Goal

RAG extends a model's effective knowledge far beyond its training data cutoff and memorization capacity. A 7B model with access to a curated retrieval corpus can outperform a 70B model on domain-specific factual questions — capability via access rather than via parameters.