Retrieval-Augmented Generation (RAG)

1. Overview & Problem Solved

Large Language Models (LLMs) are powerful, but they suffer from knowledge cutoffs and hallucinations. They don't inherently know your proprietary company data, and retraining or fine-tuning them on every data update is prohibitively expensive and slow.

Retrieval-Augmented Generation (RAG) solves this by grounding the LLM's generation in external, verifiable data. Instead of relying solely on parametric memory (what the model learned during training), RAG retrieves relevant documents from a non-parametric knowledge base (like a vector database) at inference time, injecting that context directly into the prompt.

2. When to Use / When NOT to Use

When to Use RAG

Question Answering over Proprietary Data: Customer support bots, internal wiki search.
High Hallucination Risk: When factual accuracy is critical (legal, medical, financial domains).
Dynamic Data: Your knowledge base updates frequently (daily or hourly).

When NOT to Use RAG

Task-Specific Behavioral Changes: Use fine-tuning if you want the model to change its tone, format, or reasoning style.
Broad General Knowledge Tasks: If the foundational model already knows the answer (e.g., "What is the capital of France?").
Extreme Low Latency Requirements: The retrieval step adds network overhead.

3. Architecture Diagram

Here is the standard RAG pipeline, covering both the indexing phase and the generation phase.

graph TD
    subgraph Indexing Phase
        A[Raw Documents] --> B[Chunking/Splitting]
        B --> C[Embedding Model]
        C --> D[(Vector Database)]
    end

    subgraph Retrieval & Generation Phase
        E[User Query] --> F[Embedding Model]
        F --> G[Similarity Search]
        D -.-> G
        G --> H[Retrieve Top-K Contexts]
        H --> I[Prompt Synthesis]
        E --> I
        I --> J[LLM]
        J --> K[Final Output]
    end

4. Step-by-Step Tutorial & Installation

Let's build a minimal RAG pipeline using LangChain, OpenAI, and a local ChromaDB.

Prerequisites

Install the required dependencies:

npm install @langchain/openai @langchain/community chromadb pip install langchain langchain-openai chromadb

5. Code Examples

Python (LangChain)

import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load and Chunk
loader = TextLoader("knowledge_base.txt")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# 2. Embed and Store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 3. Setup LLM and Chain
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

# 4. Invoke
response = rag_chain.invoke("What are the key benefits of this product?")
print(response)

TypeScript (LangChain.js)

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { PromptTemplate } from "@langchain/core/prompts";

async function runRAG() {
  // 1. Data Prep
  const text = "Company X's Q3 revenue grew by 15% to $50M...";
  const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 100, chunkOverlap: 20 });
  const docs = await splitter.createDocuments([text]);

  // 2. Vector Store
  const vectorStore = await MemoryVectorStore.fromDocuments(docs, new OpenAIEmbeddings());
  const retriever = vectorStore.asRetriever();

  // 3. Chain
  const llm = new ChatOpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
  const prompt = PromptTemplate.fromTemplate(`
    Answer the following question based only on the provided context:
    <context>{context}</context>
    Question: {input}
  `);
  
  const combineDocsChain = await createStuffDocumentsChain({ llm, prompt });
  const retrievalChain = await createRetrievalChain({ combineDocsChain, retriever });

  // 4. Invoke
  const result = await retrievalChain.invoke({ input: "What was the Q3 revenue?" });
  console.log(result.answer);
}

runRAG();

6. Security, Performance & Best Practices

Chunking Strategy: Don't just split by characters blindly. Use semantic splitting or structural awareness (e.g., Markdown splitters).
Hybrid Search: Combine Dense (Vector) search with Sparse (BM25/Keyword) search to handle both semantic queries and exact keyword matches (like part numbers).
Re-ranking: Retrieve a broader set of documents (e.g., top 20), then use a dedicated cross-encoder (like Cohere Rerank) to resort them, keeping only the top 3-5 for the LLM. This drastically improves precision.
Security (Prompt Injection): Isolate user input from the retrieved context. Ensure your vector store handles data segregation (e.g., multi-tenancy metadata filters) so users can't query data they lack permissions to see.

7. Related Projects & Alternatives

GraphRAG: Uses Knowledge Graphs to map entities and relationships, better for answering global "connect-the-dots" questions over a dataset.
Self-RAG / Corrective RAG (CRAG): Advanced architectures where the LLM evaluates the quality of retrieved documents and can decide to rewrite the query or search the web if local context is insufficient.
Fine-Tuning: The primary alternative. Generally, combine RAG (for facts) and Fine-Tuning (for format/behavior) for optimal results.

Editorial Integrity

Fact Checked

Written By

Senior AI Engineer

Technical Writer & Architect

Reviewed By

TechIdea Editorial Panel

Technical accuracy verified by our expert engineering panel.

Why Trust TechIdea?

This guide was created to help developers globally learn practical skills. We focus on real-world examples, objective analysis, and safe coding practices. Our content is regularly updated and subjected to strict human oversight. Read our Editorial Policy.