Retrieval-Augmented Generation (RAG)
1. Overview & Problem Solved
Large Language Models (LLMs) are powerful, but they suffer from knowledge cutoffs and hallucinations. They don't inherently know your proprietary company data, and retraining or fine-tuning them on every data update is prohibitively expensive and slow.
Retrieval-Augmented Generation (RAG) solves this by grounding the LLM's generation in external, verifiable data. Instead of relying solely on parametric memory (what the model learned during training), RAG retrieves relevant documents from a non-parametric knowledge base (like a vector database) at inference time, injecting that context directly into the prompt.
2. When to Use / When NOT to Use
When to Use RAG
- Question Answering over Proprietary Data: Customer support bots, internal wiki search.
- High Hallucination Risk: When factual accuracy is critical (legal, medical, financial domains).
- Dynamic Data: Your knowledge base updates frequently (daily or hourly).
When NOT to Use RAG
- Task-Specific Behavioral Changes: Use fine-tuning if you want the model to change its tone, format, or reasoning style.
- Broad General Knowledge Tasks: If the foundational model already knows the answer (e.g., "What is the capital of France?").
- Extreme Low Latency Requirements: The retrieval step adds network overhead.
3. Architecture Diagram
Here is the standard RAG pipeline, covering both the indexing phase and the generation phase.
graph TD
subgraph Indexing Phase
A[Raw Documents] --> B[Chunking/Splitting]
B --> C[Embedding Model]
C --> D[(Vector Database)]
end
subgraph Retrieval & Generation Phase
E[User Query] --> F[Embedding Model]
F --> G[Similarity Search]
D -.-> G
G --> H[Retrieve Top-K Contexts]
H --> I[Prompt Synthesis]
E --> I
I --> J[LLM]
J --> K[Final Output]
end4. Step-by-Step Tutorial & Installation
Let's build a minimal RAG pipeline using LangChain, OpenAI, and a local ChromaDB.
Prerequisites
Install the required dependencies:
npm install @langchain/openai @langchain/community chromadb pip install langchain langchain-openai chromadb5. Code Examples
Python (LangChain)
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Load and Chunk
loader = TextLoader("knowledge_base.txt")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# 2. Embed and Store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 3. Setup LLM and Chain
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| custom_rag_prompt
| llm
| StrOutputParser()
)
# 4. Invoke
response = rag_chain.invoke("What are the key benefits of this product?")
print(response)TypeScript (LangChain.js)
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { PromptTemplate } from "@langchain/core/prompts";
async function runRAG() {
// 1. Data Prep
const text = "Company X's Q3 revenue grew by 15% to $50M...";
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 100, chunkOverlap: 20 });
const docs = await splitter.createDocuments([text]);
// 2. Vector Store
const vectorStore = await MemoryVectorStore.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = vectorStore.asRetriever();
// 3. Chain
const llm = new ChatOpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
const prompt = PromptTemplate.fromTemplate(`
Answer the following question based only on the provided context:
<context>{context}</context>
Question: {input}
`);
const combineDocsChain = await createStuffDocumentsChain({ llm, prompt });
const retrievalChain = await createRetrievalChain({ combineDocsChain, retriever });
// 4. Invoke
const result = await retrievalChain.invoke({ input: "What was the Q3 revenue?" });
console.log(result.answer);
}
runRAG();6. Security, Performance & Best Practices
- Chunking Strategy: Don't just split by characters blindly. Use semantic splitting or structural awareness (e.g., Markdown splitters).
- Hybrid Search: Combine Dense (Vector) search with Sparse (BM25/Keyword) search to handle both semantic queries and exact keyword matches (like part numbers).
- Re-ranking: Retrieve a broader set of documents (e.g., top 20), then use a dedicated cross-encoder (like Cohere Rerank) to resort them, keeping only the top 3-5 for the LLM. This drastically improves precision.
- Security (Prompt Injection): Isolate user input from the retrieved context. Ensure your vector store handles data segregation (e.g., multi-tenancy metadata filters) so users can't query data they lack permissions to see.
7. Related Projects & Alternatives
- GraphRAG: Uses Knowledge Graphs to map entities and relationships, better for answering global "connect-the-dots" questions over a dataset.
- Self-RAG / Corrective RAG (CRAG): Advanced architectures where the LLM evaluates the quality of retrieved documents and can decide to rewrite the query or search the web if local context is insufficient.
- Fine-Tuning: The primary alternative. Generally, combine RAG (for facts) and Fine-Tuning (for format/behavior) for optimal results.
Editorial Integrity
Fact CheckedWritten By
Senior AI Engineer
Technical Writer & Architect
Reviewed By
TechIdea Editorial Panel
Technical accuracy verified by our expert engineering panel.
Why Trust TechIdea?