Building Your First
RAG System
Learn how to build RAG with LangChain in this step-by-step tutorial. Build your first retrieval augmented generation system from scratch using ChromaDB, Pinecone, or FAISS for vector storage.
What You'll Learn in This RAG Tutorial
- Understand how to build RAG systems with LangChain step-by-step
- Master document chunking strategies step-by-step for optimal RAG retrieval
- Build RAG retrieval chains from scratch with ChromaDB, Pinecone, or Weaviate
- Learn how to optimize context windows for better RAG performance
How RAG Works - Step-by-Step Architecture Guide
π What is RAG?
Learn how to build RAG (Retrieval-Augmented Generation) systems that combine LangChain with vector databases like ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus. This beginner-friendly tutorial shows you step-by-step how to retrieve relevant information from your documents and generate accurate responses.
Step-by-Step RAG Pipeline Components with LangChain
Step 1: Document Ingestion for RAG
How to load and preprocess documents for LangChain RAG systems
Step 2: Text Chunking Tutorial
Learn how to split documents for better RAG retrieval performance
Step 3: Embeddings with Vector Databases
Store embeddings in ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus
Step 4: RAG Retrieval Implementation
How to retrieve relevant chunks using LangChain and vector search
Step 5: Response Generation
Generate accurate responses with LangChain RAG chains
Build Your First RAG System - Complete Tutorial
How to Load Documents for RAG - Step-by-Step Guide
π Understanding the RAG System Class:
- β’ persist_directory: Where your vector database will be stored on disk
- β’ GoogleGenerativeAIEmbeddings: Converts text into numerical vectors for semantic search
- β’ vector_store: Database that stores document chunks and their embeddings
- β’ retriever: Component that finds relevant chunks based on queries
π¦ Required Dependencies:
Before running this RAG implementation, install the required packages:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma # Can also use Pinecone, Weaviate, Qdrant, Milvus, or FAISS
import os
class RAGSystem:
def __init__(self, persist_directory="./rag_db"):
self.persist_directory = persist_directory
self.embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001"
)
self.vector_store = None
self.retriever = None
def load_documents(self, data_path):
"""Load documents from various sources"""
documents = []
# Load PDFs
pdf_loader = DirectoryLoader(
data_path,
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True
)
documents.extend(pdf_loader.load())
# Load text files
text_loader = DirectoryLoader(
data_path,
glob="**/*.txt",
loader_cls=TextLoader,
show_progress=True
)
documents.extend(text_loader.load())
print(f"Loaded {len(documents)} documents")
return documents
def chunk_documents(self, documents, chunk_size=1000, chunk_overlap=200):
"""Split documents into optimal chunks"""
# Create text splitter with semantic awareness
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=[
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence ends
", ", # Clause breaks
" ", # Word breaks
"" # Character breaks
]
)
# Split documents
chunks = text_splitter.split_documents(documents)
# Add metadata to chunks
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["chunk_size"] = len(chunk.page_content)
print(f"Created {len(chunks)} chunks")
return chunks
π‘ Document Loading Explained:
- β’ DirectoryLoader: Recursively loads all files matching the glob pattern from a directory
- β’ glob="**/*.pdf": Finds all PDF files in any subdirectory (** means any depth)
- β’ PyPDFLoader: Extracts text from PDF files, preserving page information
- β’ show_progress=True: Displays a progress bar during loading
βοΈ Chunking Strategy Deep Dive:
- β’ RecursiveCharacterTextSplitter: Tries to split at natural boundaries (paragraphs first, then sentences)
- β’ chunk_size=1000: Each chunk contains ~1000 characters (optimal for most LLMs)
- β’ chunk_overlap=200: 200 characters repeated between chunks to preserve context
- β’ Separator hierarchy: Splits at paragraphs first, only using smaller breaks if needed
- β’ Metadata enrichment: Adds chunk_id and size for tracking and debugging
β οΈ Performance Tips:
For large document collections (>1000 files), consider:
- β’ Using async document loading for better performance
- β’ Processing documents in batches to manage memory
- β’ Adding file type filtering to avoid unsupported formats
How to Create Vector Stores with ChromaDB, Pinecone, or Weaviate
π Vector Store Concepts:
- β’ Embeddings: Converts text chunks into high-dimensional vectors (typically 768-1536 dimensions)
- β’ Vector Databases: ChromaDB (open-source), Pinecone (cloud), Weaviate (hybrid), Qdrant (performance), Milvus (scalable)
- β’ HNSW (Hierarchical Navigable Small World): Algorithm for approximate nearest neighbor search
- β’ Cosine similarity: Measures angle between vectors (better for text than Euclidean distance)
def create_vector_store(self, chunks):
"""Create and persist vector store"""
# Create vector store from chunks
# Can also use Pinecone, Weaviate, Qdrant, Milvus, or FAISS
self.vector_store = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_directory,
collection_metadata={"hnsw:space": "cosine"}
)
# Persist to disk
# Note: Chroma 0.4.x+ automatically persists documents
print(f"Vector store created with {len(chunks)} chunks")
def load_vector_store(self):
"""Load existing vector store"""
self.vector_store = Chroma(
persist_directory=self.persist_directory,
embedding_function=self.embeddings
)
print("Vector store loaded from disk")
def setup_retriever(self, k=5, search_type="similarity"):
"""Configure retriever with advanced options"""
if not self.vector_store:
raise ValueError("Vector store not initialized")
# Create retriever with specific search parameters
self.retriever = self.vector_store.as_retriever(
search_type=search_type,
search_kwargs={
"k": k,
"score_threshold": 0.7, # Minimum similarity score
"fetch_k": k * 2, # Fetch more for MMR
}
)
return self.retriever
π‘ Vector Store Creation Explained:
- β’ from_documents(): Automatically generates embeddings for all chunks and stores them
- β’ persist_directory: Saves the database to disk so you don't need to re-embed documents
- β’ collection_metadata: Configures the search algorithm (HNSW with cosine similarity)
- β’ persist(): Writes the index to disk for future use
π― Retriever Configuration Deep Dive:
- β’ k=5: Returns top 5 most similar chunks (adjust based on context window)
- β’ search_type="similarity": Pure semantic search (alternatives: "mmr" for diversity)
- β’ score_threshold=0.7: Only returns chunks with 70%+ similarity (filters out weak matches)
- β’ fetch_k=k*2: For MMR, fetches 10 candidates then selects 5 diverse ones
β οΈ Common Pitfalls:
- β’ Memory usage: Large collections can consume significant RAM during embedding
- β’ Embedding costs: Some embedding models charge per token (Google's is free)
- β’ Index persistence: Always persist to avoid re-embedding on restart
- β’ Similarity threshold: Too high = missing relevant docs, too low = irrelevant results
Build RAG Retrieval Chains - Beginner Tutorial
π RAG Chain Architecture:
- β’ RetrievalQA: Combines retriever + LLM into a question-answering pipeline
- β’ Chain Types: "stuff" (all docs in one prompt), "map_reduce" (process separately), "refine" (iterative)
- β’ Streaming: Shows response as it's generated for better UX
- β’ Prompt Engineering: Critical for preventing hallucinations
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.callbacks import StreamingStdOutCallbackHandler
def create_qa_chain(self, streaming=True):
"""Create the complete RAG chain"""
# Initialize LLM
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
temperature=0.3,
streaming=streaming,
callbacks=[StreamingStdOutCallbackHandler()] if streaming else []
)
# Create custom prompt template
prompt_template = """You are a helpful AI assistant. Use the following context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."
Context:
{context}
Question: {question}
Answer: """
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Use "map_reduce" for long contexts
retriever=self.retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
return qa_chain
def query(self, question, verbose=False):
"""Query the RAG system"""
if not self.retriever:
raise ValueError("Retriever not initialized")
# Create QA chain
qa_chain = self.create_qa_chain()
# Execute query
result = qa_chain({"query": question})
# Format response
response = {
"answer": result["result"],
"source_documents": []
}
# Add source information
for doc in result.get("source_documents", []):
source_info = {
"content": doc.page_content[:200] + "...",
"metadata": doc.metadata,
"similarity_score": getattr(doc, "score", None)
}
response["source_documents"].append(source_info)
if verbose:
print(f"\nRetrieved {len(response['source_documents'])} documents")
return response
π‘ LLM Configuration Explained:
- β’ temperature=0.3: Lower temperature for more consistent, factual responses
- β’ streaming=True: Shows tokens as generated (improves perceived latency)
- β’ StreamingStdOutCallbackHandler: Prints tokens to console in real-time
- β’ gemini-2.0-flash: Fast, cost-effective model ideal for RAG applications
π Prompt Template Strategy:
- β’ Clear instructions: Tells LLM to only use provided context
- β’ Fallback response: Prevents hallucinations when info isn't available
- β’ {context} placeholder: Where retrieved chunks are inserted
- β’ {question} placeholder: User's query goes here
- β’ Customizable: Add role-playing, formatting instructions, etc.
π Chain Type Selection Guide:
- β Best for: Small contexts (<4k tokens)
- β Limitation: Can exceed context window
- β Best for: Large document sets
- β Limitation: May lose detail
- β Best for: Complex questions
- β Limitation: Slower, more API calls
Complete RAG Tutorial - Beginner's Code Example
π How to Use Your RAG System - Step by Step:
- β’ Two modes: Build new index from scratch or load existing one
- β’ Query workflow: User query β Retrieve chunks β Generate answer with context
- β’ Advanced features: Metadata filtering for targeted searches
- β’ Production tips: Always check if index exists before rebuilding
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.callbacks import StreamingStdOutCallbackHandler
import os
from dotenv import load_dotenv
load_dotenv()
class RAGSystem:
def __init__(self, persist_directory="./rag_db"):
self.persist_directory = persist_directory
self.embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001"
)
self.vector_store = None
self.retriever = None
self.qa_chain = None
def load_documents(self, data_path):
"""Load documents from various sources"""
documents = []
# Load PDF files
pdf_loader = DirectoryLoader(
data_path,
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents.extend(pdf_loader.load())
# Load text files
text_loader = DirectoryLoader(
data_path,
glob="**/*.txt",
loader_cls=TextLoader
)
documents.extend(text_loader.load())
return documents
def chunk_documents(self, documents, chunk_size=1000, chunk_overlap=200):
"""Split documents into chunks"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
)
chunks = text_splitter.split_documents(documents)
return chunks
def create_vector_store(self, chunks):
"""Create vector store from document chunks"""
self.vector_store = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_directory
)
# Note: Chroma 0.4.x+ automatically persists documents
def load_vector_store(self):
"""Load existing vector store"""
self.vector_store = Chroma(
persist_directory=self.persist_directory,
embedding_function=self.embeddings
)
def setup_retriever(self, k=5, search_type="similarity"):
"""Setup retriever"""
if not self.vector_store:
raise ValueError("Vector store not created or loaded")
self.retriever = self.vector_store.as_retriever(
search_type=search_type,
search_kwargs={"k": k}
)
def create_qa_chain(self, streaming=True):
"""Create the complete RAG chain"""
# Initialize LLM
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
temperature=0.1
)
# Create custom prompt
prompt_template = """Use the following context to answer the question.
If you don't know the answer, just say you don't know.
Context: {context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Create QA chain
self.qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=self.retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)
def query(self, question, verbose=False):
"""Query the RAG system"""
if not self.qa_chain:
self.create_qa_chain()
response = self.qa_chain.invoke({"query": question})
if verbose:
print(f"Question: {question}")
print(f"Sources: {len(response['source_documents'])}")
return response
# Initialize RAG system - works with ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus
rag = RAGSystem(persist_directory="./my_rag_db")
# Create sample documents for testing (replace with your own data)
import os
if not os.path.exists("./data"):
os.makedirs("./data")
# Create sample text file
with open("./data/sample.txt", "w", encoding="utf-8") as f:
f.write("""RAG (Retrieval Augmented Generation) Systems
RAG systems combine the power of retrieval and generation to provide accurate, contextual responses.
Key Benefits:
1. Reduces hallucinations in AI responses
2. Enables AI to access up-to-date information
3. Provides source attribution for answers
4. Allows domain-specific knowledge integration
How RAG Works:
- Documents are processed and stored in a vector database
- User queries are converted to embeddings
- Similar document chunks are retrieved
- Retrieved context is used to generate accurate responses
Popular vector databases for RAG include ChromaDB, Pinecone, Weaviate, and FAISS.""")
# Option 1: Build new RAG index from scratch (for beginners)
try:
documents = rag.load_documents("./data")
if documents:
print(f"Loaded {len(documents)} documents")
chunks = rag.chunk_documents(documents, chunk_size=1000, chunk_overlap=200)
print(f"Created {len(chunks)} chunks")
rag.create_vector_store(chunks) # Creates embeddings and stores in vector database
print("Vector store created successfully!")
# Setup retriever
rag.setup_retriever(k=5, search_type="similarity")
# How to query your RAG system - simple example
question = "What are the main benefits of using RAG systems?"
response = rag.query(question, verbose=True) # Returns answer + sources
print(f"\nAnswer: {response['result']}")
print(f"\nSources used: {len(response['source_documents'])}")
# Show source content
for i, doc in enumerate(response['source_documents']):
print(f"\nSource {i+1}: {doc.page_content[:200]}...")
else:
print("No documents found in ./data directory")
except Exception as e:
print(f"Error: {e}")
print("Make sure you have documents in ./data directory and GOOGLE_API_KEY is set")
# Advanced query with metadata filtering
class AdvancedRAG(RAGSystem):
def query_with_filter(self, question, filter_dict):
"""Query with metadata filtering"""
# Create filtered retriever
filtered_retriever = self.vector_store.as_retriever(
search_kwargs={
"k": 5,
"filter": filter_dict
}
)
# Use filtered retriever
original_retriever = self.retriever
self.retriever = filtered_retriever
result = self.query(question)
# Restore original retriever
self.retriever = original_retriever
return result
# Example with filtering - for searching specific documents
advanced_rag = AdvancedRAG() # Works with ChromaDB, Pinecone, Weaviate, etc.
advanced_rag.load_vector_store()
advanced_rag.setup_retriever()
filtered_response = advanced_rag.query_with_filter(
"What is the implementation process?",
filter_dict={"source": "implementation_guide.pdf"}
)
π― Code Walkthrough:
Creates RAG instance with database path. This directory stores embeddings.
Loads docs β Chunks them β Creates embeddings β Stores in vector DB
Configures how many chunks to retrieve (k=5) and search strategy
Finds relevant chunks β Passes to LLM β Returns answer + sources
π§ Advanced Filtering Explained:
- β’ Metadata filtering: Search only within specific documents or categories
- β’ filter_dict: Matches chunk metadata (e.g., source filename, page number)
- β’ Use cases: Department-specific searches, date ranges, document types
- β’ Performance: Filtering happens at vector DB level (very fast)
β οΈ Production Checklist:
- β Check if vector store exists before rebuilding (saves time/cost)
- β Implement error handling for missing documents
- β Add logging for debugging retrieval issues
- β Monitor embedding costs if using paid APIs
- β Set up periodic index updates for new documents
π‘ Expected Output:
Loaded 15 documents Created 127 chunks Vector store created with 127 chunks Retrieved 5 documents Answer: Based on the context, RAG (Retrieval-Augmented Generation) systems offer several key benefits: 1. **Improved Accuracy**: By grounding responses in actual document content, RAG systems provide more accurate and factual answers compared to pure LLM generation. 2. **Reduced Hallucinations**: The retrieval component ensures that answers are based on real information, significantly reducing the likelihood of fabricated or incorrect responses. 3. **Dynamic Knowledge**: Unlike static LLMs, RAG systems can be updated with new documents without retraining, allowing for current and evolving knowledge bases. 4. **Source Attribution**: RAG systems can provide citations and source documents, enabling users to verify information and explore topics in more depth. 5. **Domain Specialization**: By using domain-specific documents, RAG systems can provide expert-level responses in specialized fields. Sources used: 5
How to Chunk Documents for RAG - Best Practices Guide
π How to Choose Chunk Size for Beginners
- β’ Small (200-500 tokens): Better precision, more chunks
- β’ Medium (500-1000 tokens): Balanced approach
- β’ Large (1000-2000 tokens): More context, less precision
- β’ Consider your model's context window
π Overlap Strategies
- β’ 10-20% overlap: Minimal redundancy
- β’ 20-30% overlap: Good context preservation
- β’ 30-50% overlap: Maximum context, more storage
- β’ Adjust based on document structure
Step-by-Step Chunking Tutorial for RAG Applications
π Smart Chunking Principles:
- β’ Document-aware: Different strategies for different file types
- β’ Semantic preservation: Keep related information together
- β’ Context windows: Balance between chunk size and retrieval precision
- β’ Overlap strategies: Ensure important info isn't split between chunks
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
TokenTextSplitter,
MarkdownHeaderTextSplitter
)
class SmartChunker:
"""Intelligent document chunking with multiple strategies"""
def __init__(self):
self.splitters = {
"recursive": RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", ", ", " ", ""]
),
"token": TokenTextSplitter(
chunk_size=500,
chunk_overlap=50
),
"semantic": self._create_semantic_splitter()
}
def _create_semantic_splitter(self):
"""Create splitter that preserves semantic boundaries"""
return RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n## ", # Markdown headers
"\n### ",
"\n\n", # Paragraphs
"\n", # Lines
". ", # Sentences
"? ", # Questions
"! ", # Exclamations
"; ", # Semicolons
", ", # Commas
" ", # Words
"" # Characters
]
)
def chunk_by_document_type(self, document):
"""Choose chunking strategy based on document type"""
file_extension = document.metadata.get("source", "").split(".")[-1]
if file_extension == "md":
# Use markdown-aware splitting
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
return markdown_splitter.split_text(document.page_content)
elif file_extension in ["py", "js", "java"]:
# Code files need different handling
return self._chunk_code(document)
else:
# Default recursive splitting
return self.splitters["recursive"].split_documents([document])
def _chunk_code(self, document):
"""Special handling for code files"""
# Split by functions/classes while preserving context
code_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=300,
separators=[
"\nclass ",
"\ndef ",
"\n\n",
"\n",
" ",
""
]
)
return code_splitter.split_documents([document])
def experimental_sliding_window(self, text, window_size=500, step=250):
"""Sliding window approach for maximum context preservation"""
chunks = []
for i in range(0, len(text), step):
chunk = text[i:i + window_size]
if len(chunk) > 100: # Minimum chunk size
chunks.append({
"content": chunk,
"start": i,
"end": min(i + window_size, len(text))
})
return chunks
# Demo: Testing Different Chunking Strategies
print("=== Smart Chunking Demo ===")
# Create sample documents of different types
sample_documents = [
{
"content": """# Introduction to RAG Systems
RAG (Retrieval Augmented Generation) represents a significant advancement in AI systems. Unlike traditional language models that rely solely on their training data, RAG systems can access external knowledge bases to provide more accurate and up-to-date information.
## How RAG Works
The RAG process involves several key steps:
1. **Document Processing**: Raw documents are processed and split into manageable chunks
2. **Embedding Generation**: Each chunk is converted into vector embeddings
3. **Vector Storage**: Embeddings are stored in specialized vector databases
4. **Query Processing**: User queries are converted to embeddings
5. **Retrieval**: Similar chunks are retrieved based on semantic similarity
6. **Generation**: Retrieved context is used to generate accurate responses
## Benefits
RAG systems offer several advantages over traditional approaches:
- Reduced hallucinations in AI responses
- Access to up-to-date information
- Source attribution for answers
- Domain-specific knowledge integration""",
"type": "markdown",
"source": "rag_guide.md"
},
{
"content": """def create_vector_store(documents, embeddings):
'''Create a vector store from processed documents'''
chunks = []
for doc in documents:
# Split document into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
doc_chunks = text_splitter.split_documents([doc])
chunks.extend(doc_chunks)
# Create vector store
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./vector_db"
)
return vector_store
class RAGPipeline:
def __init__(self, vector_store):
self.vector_store = vector_store
self.retriever = vector_store.as_retriever(search_kwargs={"k": 5})
def query(self, question):
retrieved_docs = self.retriever.get_relevant_documents(question)
return retrieved_docs""",
"type": "code",
"source": "rag_pipeline.py"
}
]
# Initialize chunker
chunker = SmartChunker()
for doc_info in sample_documents:
print(f"\n--- Processing {doc_info['source']} ({doc_info['type']}) ---")
# Create mock document object
class MockDocument:
def __init__(self, content, source):
self.page_content = content
self.metadata = {"source": source}
doc = MockDocument(doc_info["content"], doc_info["source"])
# Test different chunking strategies
print(f"Original length: {len(doc.page_content)} characters")
# Try recursive splitter
recursive_chunks = chunker.splitters["recursive"].split_documents([doc])
print(f"Recursive chunks: {len(recursive_chunks)}")
for i, chunk in enumerate(recursive_chunks[:2]): # Show first 2 chunks
print(f" Chunk {i+1}: {len(chunk.page_content)} chars - '{chunk.page_content[:80]}...'")
# Try document-type specific chunking
if doc_info["type"] == "markdown":
print("\nMarkdown-aware chunking:")
try:
md_chunks = chunker.chunk_by_document_type(doc)
print(f"Markdown chunks: {len(md_chunks)}")
for i, chunk in enumerate(md_chunks[:2]):
chunk_text = chunk.page_content if hasattr(chunk, 'page_content') else str(chunk)
print(f" MD Chunk {i+1}: {len(chunk_text)} chars")
except Exception as e:
print(f" Markdown chunking fallback to recursive")
# Test sliding window
print("\nSliding window chunking:")
sliding_chunks = chunker.experimental_sliding_window(doc.page_content, window_size=300, step=150)
print(f"Sliding window chunks: {len(sliding_chunks)}")
for i, chunk in enumerate(sliding_chunks[:3]):
print(f" Window {i+1}: chars {chunk['start']}-{chunk['end']} (length: {len(chunk['content'])})")
print("\n=== Chunking Strategy Comparison ===")
test_text = "This is a sample document. It contains multiple sentences. Each sentence provides different information. We want to test how different chunking strategies handle this content. The goal is to preserve semantic meaning while creating manageable chunks."
strategies = {
"recursive": chunker.splitters["recursive"],
"token": chunker.splitters["token"]
}
for strategy_name, splitter in strategies.items():
chunks = splitter.split_text(test_text)
print(f"\n{strategy_name.title()} strategy:")
print(f" Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f" Chunk {i+1}: '{chunk}'")
π‘ Splitter Types Explained:
Most versatile - tries larger separators first (paragraphs) then falls back to smaller ones. Best for general text.
Splits by token count - ensures chunks fit within LLM token limits. Uses tiktoken library.
Preserves document structure by splitting at headers. Perfect for documentation.
π― Document-Specific Strategies:
- β’ Markdown files: Split by headers to preserve section context
- β’ Code files: Larger chunks (1500 chars) with more overlap to keep functions intact
- β’ PDFs/Text: Standard recursive splitting with semantic boundaries
- β’ CSV/Tables: Consider row-based or column-aware splitting
π Sliding Window Technique:
- β’ window_size=500: Each chunk contains 500 characters
- β’ step=250: Move forward by 250 chars (50% overlap)
- β’ Benefits: Never loses context at chunk boundaries
- β’ Trade-off: Creates more chunks, increases storage/search time
- β’ Use case: Legal documents, contracts where every detail matters
How to Manage Context Windows in RAG Systems
Step-by-Step Context Optimization Tutorial for Beginners
π Context Window Management:
- β’ Token counting: Critical for staying within LLM limits (GPT-4: 8k-32k, Claude: 100k)
- β’ Reserved tokens: Always reserve space for the model's response
- β’ Selection strategies: Choose which chunks to include based on different criteria
- β’ Graceful truncation: If needed, truncate intelligently at sentence boundaries
π¦ Required Dependencies:
Before running this context optimization code, install the required packages:
from typing import List, Dict
import tiktoken
class ContextManager:
"""Learn how to manage context windows for RAG systems step-by-step"""
def __init__(self, model_name="gpt-3.5-turbo", max_tokens=4000):
self.encoding = tiktoken.encoding_for_model(model_name)
self.max_tokens = max_tokens
self.reserved_tokens = 500 # Reserve for response
def count_tokens(self, text: str) -> int:
"""Count tokens in text"""
return len(self.encoding.encode(text))
def optimize_context(self,
documents: List[Dict],
query: str,
strategy: str = "relevance") -> List[Dict]:
"""Optimize document selection for context window"""
query_tokens = self.count_tokens(query)
available_tokens = self.max_tokens - self.reserved_tokens - query_tokens
if strategy == "relevance":
return self._relevance_based_selection(documents, available_tokens)
elif strategy == "diversity":
return self._relevance_based_selection(documents, available_tokens) # Simplified for demo
elif strategy == "recency":
return self._relevance_based_selection(documents, available_tokens) # Simplified for demo
else:
return self._relevance_based_selection(documents, available_tokens) # Simplified for demo
def _relevance_based_selection(self, documents, available_tokens):
"""Select most relevant documents within token limit"""
selected = []
current_tokens = 0
# Sort by relevance score
sorted_docs = sorted(
documents,
key=lambda x: x.get("score", 0),
reverse=True
)
for doc in sorted_docs:
doc_tokens = self.count_tokens(doc["content"])
if current_tokens + doc_tokens <= available_tokens:
selected.append(doc)
current_tokens += doc_tokens
else:
# Try to fit partial document
remaining_tokens = available_tokens - current_tokens
if remaining_tokens > 100: # Minimum useful chunk
truncated_content = self._truncate_to_tokens(
doc["content"],
remaining_tokens
)
doc_copy = doc.copy()
doc_copy["content"] = truncated_content
doc_copy["truncated"] = True
selected.append(doc_copy)
break
return selected
def _truncate_to_tokens(self, text, max_tokens):
"""Truncate text to fit within token limit"""
tokens = self.encoding.encode(text)
if len(tokens) <= max_tokens:
return text
truncated_tokens = tokens[:max_tokens]
return self.encoding.decode(truncated_tokens)
def create_context_prompt(self, documents, query):
"""Create optimized prompt with context"""
context_parts = []
for i, doc in enumerate(documents):
source = doc.get("metadata", {}).get("source", "Unknown")
truncated = " (truncated)" if doc.get("truncated") else ""
context_parts.append(
f"[Document {i+1} - {source}{truncated}]\n"
f"{doc['content']}\n"
)
context = "\n---\n".join(context_parts)
return f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
# Usage example - Context Optimization Demo
context_manager = ContextManager(max_tokens=4000)
# Simulate retrieved documents with realistic content
documents = [
{
"content": "RAG (Retrieval Augmented Generation) is a technique that combines information retrieval with text generation. It helps AI models access external knowledge to provide more accurate and up-to-date responses by retrieving relevant documents before generating answers.",
"score": 0.95,
"metadata": {"source": "rag_guide.pdf"}
},
{
"content": "Vector databases like ChromaDB, Pinecone, and Weaviate are essential components of RAG systems. They store document embeddings and enable fast similarity search to find relevant context for user queries.",
"score": 0.87,
"metadata": {"source": "vector_db_tutorial.txt"}
},
{
"content": "Traditional search relies on keyword matching, while RAG uses semantic similarity. This allows the system to understand meaning and context, making it more effective for question answering applications.",
"score": 0.82,
"metadata": {"source": "semantic_search.md"}
},
]
print("=== Context Optimization Demo ===")
print(f"Maximum tokens: {context_manager.max_tokens}")
print(f"Reserved for response: {context_manager.reserved_tokens}")
query = "What is RAG and how does it work?"
query_tokens = context_manager.count_tokens(query)
print(f"Query tokens: {query_tokens}")
# Test different strategies
for strategy in ["relevance", "diversity"]:
print(f"\n--- Strategy: {strategy} ---")
# Optimize context
optimized_docs = context_manager.optimize_context(
documents,
query,
strategy=strategy
)
print(f"Selected documents: {len(optimized_docs)}")
total_tokens = query_tokens + context_manager.reserved_tokens
for i, doc in enumerate(optimized_docs):
doc_tokens = context_manager.count_tokens(doc["content"])
total_tokens += doc_tokens
truncated = " (TRUNCATED)" if doc.get("truncated") else ""
print(f" Doc {i+1}: {doc_tokens} tokens{truncated}")
print(f"Total tokens used: {total_tokens}/{context_manager.max_tokens}")
# Create and show sample prompt
if strategy == "relevance": # Show prompt for one strategy
prompt = context_manager.create_context_prompt(optimized_docs, query)
print(f"\nSample prompt preview (first 200 chars):")
print(f"{prompt[:200]}...")
print("\n=== Token Counting Examples ===")
sample_texts = [
"Hello world",
"This is a longer sentence with more tokens to demonstrate counting.",
"RAG systems are amazing!"
]
for text in sample_texts:
token_count = context_manager.count_tokens(text)
print(f"Text: '{text}' β {token_count} tokens")
π‘ Key Components Explained:
Essential for accurate context management. Different models use different tokenizers - tiktoken handles OpenAI models.
Always reserve 500-1000 tokens for the model's response. Running out of space mid-response is bad UX.
Greedy approach: add highest-scoring docs until space runs out. Considers partial inclusion of last doc.
π Context Selection Strategies:
Prioritizes highest-scoring chunks. Best for focused questions.
Includes varied perspectives. Good for comprehensive answers.
Prioritizes newer content. Ideal for time-sensitive queries.
Combines multiple factors. Most robust for general use.
β οΈ Token Limit Considerations:
- β’ GPT-3.5: 4k tokens (budget) or 16k tokens (extended)
- β’ GPT-4: 8k, 32k, or 128k tokens depending on version
- β’ Claude: 100k tokens (huge context window)
- β’ Rule of thumb: 1 token β 0.75 words in English
RAG Performance Optimization
π Retrieval Optimization
Hybrid Search: Combines semantic (vector) + keyword (BM25) search for better coverage
# Hybrid search combining multiple strategies
from langchain.retrievers import (
ContextualCompressionRetriever,
EnsembleRetriever
)
from langchain.retrievers.document_compressors import (
LLMChainExtractor
)
# Create ensemble retriever
semantic_retriever = vector_store.as_retriever(
search_kwargs={"k": 10}
)
keyword_retriever = BM25Retriever.from_documents(
documents
)
keyword_retriever.k = 10
ensemble = EnsembleRetriever(
retrievers=[semantic_retriever, keyword_retriever],
weights=[0.6, 0.4]
)
# Add compression
compressor = LLMChainExtractor.from_llm(llm)
compressed_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble
)
π‘ Benefits: Semantic search finds conceptually similar content, while BM25 catches exact keyword matches. Compression removes irrelevant parts.
πΎ Caching Strategy
Query Caching: Stores results for repeated questions to reduce latency and API costs
from functools import lru_cache
import hashlib
class CachedRAG:
def __init__(self, rag_system):
self.rag = rag_system
self.cache = {}
@lru_cache(maxsize=1000)
def _get_query_hash(self, query):
"""Create hash of query for caching"""
return hashlib.md5(
query.encode()
).hexdigest()
def query_with_cache(self, query):
"""Query with caching"""
query_hash = self._get_query_hash(query)
if query_hash in self.cache:
print("Cache hit!")
return self.cache[query_hash]
# Execute query
result = self.rag.query(query)
# Cache result
self.cache[query_hash] = result
return result
π‘ Benefits: Instant responses for common questions. Reduces embedding/LLM API costs by 50-80% in production.
π Advanced Performance Techniques
Parallel Processing:
# Async retrieval for multiple queries
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def parallel_retrieve(queries):
with ThreadPoolExecutor() as executor:
tasks = [
executor.submit(rag.query, q)
for q in queries
]
results = await asyncio.gather(*tasks)
return results
Batch Embeddings:
# Process documents in batches
def batch_embed_documents(docs, batch_size=100):
embeddings = []
for i in range(0, len(docs), batch_size):
batch = docs[i:i + batch_size]
batch_embeddings = embedding_model.embed_documents(
[d.page_content for d in batch]
)
embeddings.extend(batch_embeddings)
return embeddings
β¨ RAG Best Practices
Document Processing
- β’ Clean and preprocess documents before chunking
- β’ Preserve document structure and metadata
- β’ Test different chunk sizes for your use case
- β’ Consider domain-specific chunking strategies
Retrieval Quality
- β’ Use hybrid search for better coverage
- β’ Implement relevance feedback loops
- β’ Monitor and optimize retrieval metrics
- β’ Consider query expansion techniques
π§ Common RAG Challenges
Lost Context Problem
Issue: Important information split across chunks
Solution: Increase chunk overlap and use sliding windows
Retrieval Precision
Issue: Retrieved chunks not always most relevant
Solution: Implement reranking and query expansion
Hallucination Risk
Issue: LLM generates information not in context
Solution: Use strict prompts and fact-checking mechanisms
π Next Steps
Congratulations on building your first complete RAG system! You've mastered the fundamentals of document processing, retrieval, and generation. Next, you'll explore advanced RAG techniques including query expansion, reranking, and multi-modal RAG.