Table of contents
Understanding RAG
RAG stands for Retrieval-Augmented Generation — a technique that combines two capabilities: retrieving relevant information from an external knowledge source, and using a language model to generate a coherent, accurate answer based on that retrieved content.
In simple terms: instead of relying solely on what an AI model memorized during training, RAG lets the model look something up first — then answer.
Concepts
Retrieval-Augmented Generation (RAG) — a pattern that extends an LLM with a live knowledge retrieval step. The model receives both the user's question and the retrieved context when generating a response.
Embedding — the process of converting a piece of text into a fixed-size numerical vector that captures its semantic meaning. Similar texts produce vectors that are close together in vector space.
Vector Store — a database optimized for storing and searching high-dimensional vectors. It enables fast similarity lookups across millions of entries.
Cosine Similarity — a metric used to measure how similar two vectors are, regardless of their magnitude. A score of 1 means identical direction; 0 means unrelated.
Chunking — splitting a long document into smaller, self-contained pieces so that each chunk can be embedded and retrieved independently.
Top-K Retrieval — fetching the K most semantically similar chunks from the vector store for a given query.
Prompt Augmentation — injecting retrieved chunks into the LLM's prompt as context, so the model generates answers grounded in real data rather than training memory.
Hallucination — when an LLM generates a confident but factually incorrect response, typically because it lacks grounding information.
What Problem Does RAG Solve?
Large language models are trained on massive datasets up to a fixed point in time. This creates several real problems in production:
- Knowledge cutoff — the model has no awareness of events or changes after its training date.
- Hallucination — when the model doesn't know an answer, it may fabricate a plausible-sounding but incorrect one.
- No private data access — the model was never trained on your company's internal documents, product manuals, or proprietary databases.
- Retraining cost — keeping a model's knowledge current by retraining it is expensive, slow, and impractical for frequently changing data.
RAG addresses all of these by keeping the knowledge external and queryable, fetching only what's needed at the moment a question is asked.
How RAG Works — Step by Step
RAG has three runtime phases that map directly to its name: Retrieval, Augmentation, Generation — plus an upfront indexing phase that prepares the knowledge base.
flowchart TD
A([User Question]) --> B[Embed the Question\ninto a Vector]
B --> C[(Vector Store\nKnowledge Base)]
C --> D[Retrieve Top-K\nRelevant Chunks]
D --> E[Augment the Prompt\nwith Retrieved Chunks]
E --> F[LLM Generates\nthe Answer]
F --> G([Final Answer\nto User])
Phase 0 — Indexing
Before RAG can retrieve anything, the knowledge base must be prepared. This step runs once (or whenever the data changes):
- Collect documents — PDFs, manuals, articles, database records, etc.
- Split into chunks — break each document into smaller pieces (e.g., paragraphs or 300-token windows).
- Embed each chunk — convert each chunk into a numerical vector using an embedding model (e.g.,
text-embedding-3-smallfrom OpenAI). - Store in a vector database — save each vector alongside its source text (e.g., in Pinecone, Chroma, Weaviate, or pgvector).
flowchart LR
A[Raw Documents] --> B[Chunking]
B --> C[Embedding Model]
C --> D[(Vector Database)]
Phase R — Retrieval
When a user submits a question:
- The question is passed through the same embedding model to produce a query vector.
- That query vector is compared against all stored vectors using cosine similarity.
- The top-K most similar chunks are retrieved and passed along.
Phase A — Augmentation
The retrieved chunks are injected into the prompt that gets sent to the LLM. A typical augmented prompt looks like this:
You are a helpful assistant. Use only the information below to answer.
Context:
[Chunk 1 text...]
[Chunk 2 text...]
[Chunk 3 text...]
Question: What is the return policy for damaged items?
Phase G — Generation
The LLM reads both the injected context and the user's question, then generates an answer grounded in the retrieved information — not drawn from training memory alone. This is what makes RAG responses both accurate and traceable.
End-to-End Example: Customer Support Bot
Let's trace through a real-world scenario. You're building an AI assistant for an e-commerce platform that answers questions about orders, shipping, and return policies using the company's internal documentation.
Step 1 — Indexing the Knowledge Base
The company has a policy document that contains:
"Items damaged during shipping can be returned within 30 days. The customer must provide a photo of the damage. Refunds are processed within 5–7 business days."
This paragraph gets chunked, embedded into a vector, and stored in the vector database.
flowchart LR
A["Policy PDF"] --> B["Chunk:\n'Items damaged during shipping\ncan be returned within 30 days...'"]
B --> C["Embedding Model\n[0.12, -0.87, 0.45, ...]"]
C --> D[(Vector DB)]
Step 2 — User Asks a Question
A user types:
"I received a broken item. Can I get a refund?"
The system passes this question through the embedding model and produces a query vector:
[0.11, -0.85, 0.49, ...]
Step 3 — Retrieval
The system searches the vector database. The chunk about damaged items has a high cosine similarity to the query vector and is retrieved as the top match.
flowchart TD
A["Query Vector\n[0.11, -0.85, 0.49]"] --> B[(Vector DB)]
B --> C["Similarity Search"]
C --> D["Top Match:\n'Items damaged during shipping\ncan be returned within 30 days...'"]
Step 4 — Augmentation
The system builds the following augmented prompt:
You are a customer support assistant. Use only the context below to answer.
Context:
Items damaged during shipping can be returned within 30 days.
The customer must provide a photo of the damage.
Refunds are processed within 5–7 business days.
Question: I received a broken item. Can I get a refund?
Step 5 — Generation
The LLM generates:
"Yes! Since your item was damaged during shipping, you can return it within 30 days. Please include a photo of the damage with your return request. Once received, your refund will be processed within 5–7 business days."
The answer is accurate, grounded in real company policy, and contains nothing the model invented from training data.
Key Components Summary
flowchart TB
subgraph Indexing
A[Documents] --> B[Chunker]
B --> C[Embedding Model]
C --> D[(Vector Store)]
end
subgraph Query Time
E[User Question] --> F[Embedding Model]
F --> G[Similarity Search]
D --> G
G --> H[Retrieved Chunks]
H --> I[Prompt Builder]
E --> I
I --> J[LLM]
J --> K[Answer]
end
| Component | Role |
|---|---|
| Chunker | Splits documents into manageable pieces |
| Embedding Model | Converts text to numerical vectors |
| Vector Store | Stores and searches vectors efficiently |
| Retriever | Finds the most relevant chunks for a query |
| LLM | Reads context + question, generates the answer |
When to Use RAG
RAG is the right choice when:
- You need the AI to answer questions about private or proprietary data that was never part of training.
- Your knowledge base changes frequently — re-indexing documents is far cheaper than retraining.
- You need citations or traceability — you can show users exactly which chunk the answer came from.
- You want to reduce hallucinations by grounding every response in real, retrieved documents.
RAG vs. Fine-Tuning
Both RAG and fine-tuning can give an LLM access to domain-specific knowledge, but they do so in fundamentally different ways.
| RAG | Fine-Tuning | |
|---|---|---|
| Updates knowledge | Re-index documents | Retrain the model |
| Handles private data | Yes | Yes (but data is baked in) |
| Reduces hallucination | Yes (with grounding) | Partially |
| Cost | Low (retrieval is cheap) | High (training is expensive) |
| Transparency | High (can cite sources) | Low (knowledge is implicit) |
Fine-tuning bakes knowledge into the model's weights — useful for teaching it a specific tone or task style, but expensive and opaque. RAG keeps knowledge external and queryable — cheaper to update, easier to audit, and better at reducing hallucinations when the retrieval quality is high.
Summary
RAG is a practical, proven pattern that gives language models access to up-to-date, domain-specific knowledge without requiring any retraining. It works in four clear steps:
- Indexing — chunk and embed documents into a vector store.
- Retrieval — embed the user's query and fetch the most semantically similar chunks.
- Augmentation — inject those chunks into the LLM prompt as grounding context.
- Generation — the LLM produces an accurate, traceable answer from the provided context.
It directly solves the core limitations of static language models — knowledge cutoffs, hallucinations, and inability to access private data — which is why RAG has become one of the most widely deployed patterns in production AI systems today.
