Phase 3Advanced⏱ 120 minutes

Building Your First
RAG System

Learn how to build RAG with LangChain in this step-by-step tutorial. Build your first retrieval augmented generation system from scratch using ChromaDB, Pinecone, or FAISS for vector storage.

🎯

What You'll Learn in This RAG Tutorial

  • Understand how to build RAG systems with LangChain step-by-step
  • Master document chunking strategies step-by-step for optimal RAG retrieval
  • Build RAG retrieval chains from scratch with ChromaDB, Pinecone, or Weaviate
  • Learn how to optimize context windows for better RAG performance
πŸ—οΈ

How RAG Works - Step-by-Step Architecture Guide

πŸ” What is RAG?

Learn how to build RAG (Retrieval-Augmented Generation) systems that combine LangChain with vector databases like ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus. This beginner-friendly tutorial shows you step-by-step how to retrieve relevant information from your documents and generate accurate responses.

Step-by-Step RAG Pipeline Components with LangChain

1

Step 1: Document Ingestion for RAG

How to load and preprocess documents for LangChain RAG systems

2

Step 2: Text Chunking Tutorial

Learn how to split documents for better RAG retrieval performance

3

Step 3: Embeddings with Vector Databases

Store embeddings in ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus

4

Step 4: RAG Retrieval Implementation

How to retrieve relevant chunks using LangChain and vector search

5

Step 5: Response Generation

Generate accurate responses with LangChain RAG chains

πŸ’»

Build Your First RAG System - Complete Tutorial

How to Load Documents for RAG - Step-by-Step Guide

πŸ” Understanding the RAG System Class:

  • β€’ persist_directory: Where your vector database will be stored on disk
  • β€’ GoogleGenerativeAIEmbeddings: Converts text into numerical vectors for semantic search
  • β€’ vector_store: Database that stores document chunks and their embeddings
  • β€’ retriever: Component that finds relevant chunks based on queries

πŸ“¦ Required Dependencies:

Before running this RAG implementation, install the required packages:

pip install langchain langchain-community langchain-google-genai langchain-chroma python-dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma  # Can also use Pinecone, Weaviate, Qdrant, Milvus, or FAISS
import os

class RAGSystem:
    def __init__(self, persist_directory="./rag_db"):
        self.persist_directory = persist_directory
        self.embeddings = GoogleGenerativeAIEmbeddings(
            model="models/embedding-001"
        )
        self.vector_store = None
        self.retriever = None
        
    def load_documents(self, data_path):
        """Load documents from various sources"""
        documents = []
        
        # Load PDFs
        pdf_loader = DirectoryLoader(
            data_path,
            glob="**/*.pdf",
            loader_cls=PyPDFLoader,
            show_progress=True
        )
        documents.extend(pdf_loader.load())
        
        # Load text files
        text_loader = DirectoryLoader(
            data_path,
            glob="**/*.txt",
            loader_cls=TextLoader,
            show_progress=True
        )
        documents.extend(text_loader.load())
        
        print(f"Loaded {len(documents)} documents")
        return documents
    
    def chunk_documents(self, documents, chunk_size=1000, chunk_overlap=200):
        """Split documents into optimal chunks"""
        # Create text splitter with semantic awareness
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=[
                "\n\n",    # Paragraph breaks
                "\n",       # Line breaks
                ". ",       # Sentence ends
                ", ",       # Clause breaks
                " ",        # Word breaks
                ""          # Character breaks
            ]
        )
        
        # Split documents
        chunks = text_splitter.split_documents(documents)
        
        # Add metadata to chunks
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_size"] = len(chunk.page_content)
            
        print(f"Created {len(chunks)} chunks")
        return chunks

πŸ’‘ Document Loading Explained:

  • β€’ DirectoryLoader: Recursively loads all files matching the glob pattern from a directory
  • β€’ glob="**/*.pdf": Finds all PDF files in any subdirectory (** means any depth)
  • β€’ PyPDFLoader: Extracts text from PDF files, preserving page information
  • β€’ show_progress=True: Displays a progress bar during loading

βœ‚οΈ Chunking Strategy Deep Dive:

  • β€’ RecursiveCharacterTextSplitter: Tries to split at natural boundaries (paragraphs first, then sentences)
  • β€’ chunk_size=1000: Each chunk contains ~1000 characters (optimal for most LLMs)
  • β€’ chunk_overlap=200: 200 characters repeated between chunks to preserve context
  • β€’ Separator hierarchy: Splits at paragraphs first, only using smaller breaks if needed
  • β€’ Metadata enrichment: Adds chunk_id and size for tracking and debugging

⚠️ Performance Tips:

For large document collections (>1000 files), consider:

  • β€’ Using async document loading for better performance
  • β€’ Processing documents in batches to manage memory
  • β€’ Adding file type filtering to avoid unsupported formats

How to Create Vector Stores with ChromaDB, Pinecone, or Weaviate

πŸ” Vector Store Concepts:

  • β€’ Embeddings: Converts text chunks into high-dimensional vectors (typically 768-1536 dimensions)
  • β€’ Vector Databases: ChromaDB (open-source), Pinecone (cloud), Weaviate (hybrid), Qdrant (performance), Milvus (scalable)
  • β€’ HNSW (Hierarchical Navigable Small World): Algorithm for approximate nearest neighbor search
  • β€’ Cosine similarity: Measures angle between vectors (better for text than Euclidean distance)
    def create_vector_store(self, chunks):
        """Create and persist vector store"""
        # Create vector store from chunks
        # Can also use Pinecone, Weaviate, Qdrant, Milvus, or FAISS
        self.vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory,
            collection_metadata={"hnsw:space": "cosine"}
        )
        
        # Persist to disk
        # Note: Chroma 0.4.x+ automatically persists documents
        print(f"Vector store created with {len(chunks)} chunks")
        
    def load_vector_store(self):
        """Load existing vector store"""
        self.vector_store = Chroma(
            persist_directory=self.persist_directory,
            embedding_function=self.embeddings
        )
        print("Vector store loaded from disk")
    
    def setup_retriever(self, k=5, search_type="similarity"):
        """Configure retriever with advanced options"""
        if not self.vector_store:
            raise ValueError("Vector store not initialized")
        
        # Create retriever with specific search parameters
        self.retriever = self.vector_store.as_retriever(
            search_type=search_type,
            search_kwargs={
                "k": k,
                "score_threshold": 0.7,  # Minimum similarity score
                "fetch_k": k * 2,        # Fetch more for MMR
            }
        )
        
        return self.retriever

πŸ’‘ Vector Store Creation Explained:

  • β€’ from_documents(): Automatically generates embeddings for all chunks and stores them
  • β€’ persist_directory: Saves the database to disk so you don't need to re-embed documents
  • β€’ collection_metadata: Configures the search algorithm (HNSW with cosine similarity)
  • β€’ persist(): Writes the index to disk for future use

🎯 Retriever Configuration Deep Dive:

  • β€’ k=5: Returns top 5 most similar chunks (adjust based on context window)
  • β€’ search_type="similarity": Pure semantic search (alternatives: "mmr" for diversity)
  • β€’ score_threshold=0.7: Only returns chunks with 70%+ similarity (filters out weak matches)
  • β€’ fetch_k=k*2: For MMR, fetches 10 candidates then selects 5 diverse ones

⚠️ Common Pitfalls:

  • β€’ Memory usage: Large collections can consume significant RAM during embedding
  • β€’ Embedding costs: Some embedding models charge per token (Google's is free)
  • β€’ Index persistence: Always persist to avoid re-embedding on restart
  • β€’ Similarity threshold: Too high = missing relevant docs, too low = irrelevant results

Build RAG Retrieval Chains - Beginner Tutorial

πŸ” RAG Chain Architecture:

  • β€’ RetrievalQA: Combines retriever + LLM into a question-answering pipeline
  • β€’ Chain Types: "stuff" (all docs in one prompt), "map_reduce" (process separately), "refine" (iterative)
  • β€’ Streaming: Shows response as it's generated for better UX
  • β€’ Prompt Engineering: Critical for preventing hallucinations
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.callbacks import StreamingStdOutCallbackHandler

    def create_qa_chain(self, streaming=True):
        """Create the complete RAG chain"""
        # Initialize LLM
        llm = ChatGoogleGenerativeAI(
            model="gemini-2.0-flash",
            temperature=0.3,
            streaming=streaming,
            callbacks=[StreamingStdOutCallbackHandler()] if streaming else []
        )
        
        # Create custom prompt template
        prompt_template = """You are a helpful AI assistant. Use the following context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."

Context:
{context}

Question: {question}

Answer: """
        
        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )
        
        # Create retrieval chain
        qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",  # Use "map_reduce" for long contexts
            retriever=self.retriever,
            return_source_documents=True,
            chain_type_kwargs={"prompt": PROMPT}
        )
        
        return qa_chain
    
    def query(self, question, verbose=False):
        """Query the RAG system"""
        if not self.retriever:
            raise ValueError("Retriever not initialized")
        
        # Create QA chain
        qa_chain = self.create_qa_chain()
        
        # Execute query
        result = qa_chain({"query": question})
        
        # Format response
        response = {
            "answer": result["result"],
            "source_documents": []
        }
        
        # Add source information
        for doc in result.get("source_documents", []):
            source_info = {
                "content": doc.page_content[:200] + "...",
                "metadata": doc.metadata,
                "similarity_score": getattr(doc, "score", None)
            }
            response["source_documents"].append(source_info)
        
        if verbose:
            print(f"\nRetrieved {len(response['source_documents'])} documents")
            
        return response

πŸ’‘ LLM Configuration Explained:

  • β€’ temperature=0.3: Lower temperature for more consistent, factual responses
  • β€’ streaming=True: Shows tokens as generated (improves perceived latency)
  • β€’ StreamingStdOutCallbackHandler: Prints tokens to console in real-time
  • β€’ gemini-2.0-flash: Fast, cost-effective model ideal for RAG applications

πŸ“ Prompt Template Strategy:

  • β€’ Clear instructions: Tells LLM to only use provided context
  • β€’ Fallback response: Prevents hallucinations when info isn't available
  • β€’ {context} placeholder: Where retrieved chunks are inserted
  • β€’ {question} placeholder: User's query goes here
  • β€’ Customizable: Add role-playing, formatting instructions, etc.

πŸ”„ Chain Type Selection Guide:

"stuff": Puts all docs in one prompt
  • βœ… Best for: Small contexts (<4k tokens)
  • ❌ Limitation: Can exceed context window
"map_reduce": Summarizes each doc, then combines
  • βœ… Best for: Large document sets
  • ❌ Limitation: May lose detail
"refine": Iteratively builds answer
  • βœ… Best for: Complex questions
  • ❌ Limitation: Slower, more API calls

Complete RAG Tutorial - Beginner's Code Example

πŸ” How to Use Your RAG System - Step by Step:

  • β€’ Two modes: Build new index from scratch or load existing one
  • β€’ Query workflow: User query β†’ Retrieve chunks β†’ Generate answer with context
  • β€’ Advanced features: Metadata filtering for targeted searches
  • β€’ Production tips: Always check if index exists before rebuilding
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.callbacks import StreamingStdOutCallbackHandler
import os
from dotenv import load_dotenv

load_dotenv()

class RAGSystem:
    def __init__(self, persist_directory="./rag_db"):
        self.persist_directory = persist_directory
        self.embeddings = GoogleGenerativeAIEmbeddings(
            model="models/embedding-001"
        )
        self.vector_store = None
        self.retriever = None
        self.qa_chain = None
        
    def load_documents(self, data_path):
        """Load documents from various sources"""
        documents = []
        
        # Load PDF files
        pdf_loader = DirectoryLoader(
            data_path,
            glob="**/*.pdf",
            loader_cls=PyPDFLoader
        )
        documents.extend(pdf_loader.load())
        
        # Load text files
        text_loader = DirectoryLoader(
            data_path,
            glob="**/*.txt",
            loader_cls=TextLoader
        )
        documents.extend(text_loader.load())
        
        return documents
    
    def chunk_documents(self, documents, chunk_size=1000, chunk_overlap=200):
        """Split documents into chunks"""
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )
        chunks = text_splitter.split_documents(documents)
        return chunks
    
    def create_vector_store(self, chunks):
        """Create vector store from document chunks"""
        self.vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory
        )
        # Note: Chroma 0.4.x+ automatically persists documents
        
    def load_vector_store(self):
        """Load existing vector store"""
        self.vector_store = Chroma(
            persist_directory=self.persist_directory,
            embedding_function=self.embeddings
        )
    
    def setup_retriever(self, k=5, search_type="similarity"):
        """Setup retriever"""
        if not self.vector_store:
            raise ValueError("Vector store not created or loaded")
            
        self.retriever = self.vector_store.as_retriever(
            search_type=search_type,
            search_kwargs={"k": k}
        )
    
    def create_qa_chain(self, streaming=True):
        """Create the complete RAG chain"""
        # Initialize LLM
        llm = ChatGoogleGenerativeAI(
            model="gemini-2.0-flash",
            temperature=0.1
        )
        
        # Create custom prompt
        prompt_template = """Use the following context to answer the question. 
        If you don't know the answer, just say you don't know.

        Context: {context}

        Question: {question}

        Answer:"""
        
        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )
        
        # Create QA chain
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            retriever=self.retriever,
            chain_type_kwargs={"prompt": PROMPT},
            return_source_documents=True
        )
    
    def query(self, question, verbose=False):
        """Query the RAG system"""
        if not self.qa_chain:
            self.create_qa_chain()
        
        response = self.qa_chain.invoke({"query": question})
        
        if verbose:
            print(f"Question: {question}")
            print(f"Sources: {len(response['source_documents'])}")
        
        return response

# Initialize RAG system - works with ChromaDB, Pinecone, Weaviate, Qdrant, or Milvus
rag = RAGSystem(persist_directory="./my_rag_db")

# Create sample documents for testing (replace with your own data)
import os
if not os.path.exists("./data"):
    os.makedirs("./data")
    
    # Create sample text file
    with open("./data/sample.txt", "w", encoding="utf-8") as f:
        f.write("""RAG (Retrieval Augmented Generation) Systems

RAG systems combine the power of retrieval and generation to provide accurate, contextual responses.

Key Benefits:
1. Reduces hallucinations in AI responses
2. Enables AI to access up-to-date information
3. Provides source attribution for answers
4. Allows domain-specific knowledge integration

How RAG Works:
- Documents are processed and stored in a vector database
- User queries are converted to embeddings
- Similar document chunks are retrieved
- Retrieved context is used to generate accurate responses

Popular vector databases for RAG include ChromaDB, Pinecone, Weaviate, and FAISS.""")

# Option 1: Build new RAG index from scratch (for beginners)
try:
    documents = rag.load_documents("./data")
    if documents:
        print(f"Loaded {len(documents)} documents")
        chunks = rag.chunk_documents(documents, chunk_size=1000, chunk_overlap=200)
        print(f"Created {len(chunks)} chunks")
        rag.create_vector_store(chunks)  # Creates embeddings and stores in vector database
        print("Vector store created successfully!")
        
        # Setup retriever
        rag.setup_retriever(k=5, search_type="similarity")
        
        # How to query your RAG system - simple example
        question = "What are the main benefits of using RAG systems?"
        response = rag.query(question, verbose=True)  # Returns answer + sources
        
        print(f"\nAnswer: {response['result']}")
        print(f"\nSources used: {len(response['source_documents'])}")
        
        # Show source content
        for i, doc in enumerate(response['source_documents']):
            print(f"\nSource {i+1}: {doc.page_content[:200]}...")
    else:
        print("No documents found in ./data directory")
        
except Exception as e:
    print(f"Error: {e}")
    print("Make sure you have documents in ./data directory and GOOGLE_API_KEY is set")

# Advanced query with metadata filtering
class AdvancedRAG(RAGSystem):
    def query_with_filter(self, question, filter_dict):
        """Query with metadata filtering"""
        # Create filtered retriever
        filtered_retriever = self.vector_store.as_retriever(
            search_kwargs={
                "k": 5,
                "filter": filter_dict
            }
        )
        
        # Use filtered retriever
        original_retriever = self.retriever
        self.retriever = filtered_retriever
        
        result = self.query(question)
        
        # Restore original retriever
        self.retriever = original_retriever
        
        return result

# Example with filtering - for searching specific documents
advanced_rag = AdvancedRAG()  # Works with ChromaDB, Pinecone, Weaviate, etc.
advanced_rag.load_vector_store()
advanced_rag.setup_retriever()

filtered_response = advanced_rag.query_with_filter(
    "What is the implementation process?",
    filter_dict={"source": "implementation_guide.pdf"}
)

🎯 Code Walkthrough:

1. Initialization (lines 1-2):

Creates RAG instance with database path. This directory stores embeddings.

2. Document Processing (lines 4-7):

Loads docs β†’ Chunks them β†’ Creates embeddings β†’ Stores in vector DB

3. Retriever Setup (line 13):

Configures how many chunks to retrieve (k=5) and search strategy

4. Query Execution (lines 16-17):

Finds relevant chunks β†’ Passes to LLM β†’ Returns answer + sources

πŸ”§ Advanced Filtering Explained:

  • β€’ Metadata filtering: Search only within specific documents or categories
  • β€’ filter_dict: Matches chunk metadata (e.g., source filename, page number)
  • β€’ Use cases: Department-specific searches, date ranges, document types
  • β€’ Performance: Filtering happens at vector DB level (very fast)

⚠️ Production Checklist:

  • βœ“ Check if vector store exists before rebuilding (saves time/cost)
  • βœ“ Implement error handling for missing documents
  • βœ“ Add logging for debugging retrieval issues
  • βœ“ Monitor embedding costs if using paid APIs
  • βœ“ Set up periodic index updates for new documents

πŸ’‘ Expected Output:

Loaded 15 documents
Created 127 chunks
Vector store created with 127 chunks

Retrieved 5 documents

Answer: Based on the context, RAG (Retrieval-Augmented Generation) systems offer several key benefits:

1. **Improved Accuracy**: By grounding responses in actual document content, RAG systems provide more accurate and factual answers compared to pure LLM generation.

2. **Reduced Hallucinations**: The retrieval component ensures that answers are based on real information, significantly reducing the likelihood of fabricated or incorrect responses.

3. **Dynamic Knowledge**: Unlike static LLMs, RAG systems can be updated with new documents without retraining, allowing for current and evolving knowledge bases.

4. **Source Attribution**: RAG systems can provide citations and source documents, enabling users to verify information and explore topics in more depth.

5. **Domain Specialization**: By using domain-specific documents, RAG systems can provide expert-level responses in specialized fields.

Sources used: 5
βœ‚οΈ

How to Chunk Documents for RAG - Best Practices Guide

πŸ“ How to Choose Chunk Size for Beginners

  • β€’ Small (200-500 tokens): Better precision, more chunks
  • β€’ Medium (500-1000 tokens): Balanced approach
  • β€’ Large (1000-2000 tokens): More context, less precision
  • β€’ Consider your model's context window

πŸ”„ Overlap Strategies

  • β€’ 10-20% overlap: Minimal redundancy
  • β€’ 20-30% overlap: Good context preservation
  • β€’ 30-50% overlap: Maximum context, more storage
  • β€’ Adjust based on document structure

Step-by-Step Chunking Tutorial for RAG Applications

πŸ” Smart Chunking Principles:

  • β€’ Document-aware: Different strategies for different file types
  • β€’ Semantic preservation: Keep related information together
  • β€’ Context windows: Balance between chunk size and retrieval precision
  • β€’ Overlap strategies: Ensure important info isn't split between chunks
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter
)

class SmartChunker:
    """Intelligent document chunking with multiple strategies"""
    
    def __init__(self):
        self.splitters = {
            "recursive": RecursiveCharacterTextSplitter(
                chunk_size=1000,
                chunk_overlap=200,
                separators=["\n\n", "\n", ". ", ", ", " ", ""]
            ),
            "token": TokenTextSplitter(
                chunk_size=500,
                chunk_overlap=50
            ),
            "semantic": self._create_semantic_splitter()
        }
    
    def _create_semantic_splitter(self):
        """Create splitter that preserves semantic boundaries"""
        return RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=[
                "\n## ",     # Markdown headers
                "\n### ",    
                "\n\n",     # Paragraphs
                "\n",        # Lines
                ". ",         # Sentences
                "? ",         # Questions
                "! ",         # Exclamations
                "; ",         # Semicolons
                ", ",         # Commas
                " ",          # Words
                ""            # Characters
            ]
        )
    
    def chunk_by_document_type(self, document):
        """Choose chunking strategy based on document type"""
        file_extension = document.metadata.get("source", "").split(".")[-1]
        
        if file_extension == "md":
            # Use markdown-aware splitting
            headers_to_split_on = [
                ("#", "Header 1"),
                ("##", "Header 2"),
                ("###", "Header 3"),
            ]
            markdown_splitter = MarkdownHeaderTextSplitter(
                headers_to_split_on=headers_to_split_on
            )
            return markdown_splitter.split_text(document.page_content)
        
        elif file_extension in ["py", "js", "java"]:
            # Code files need different handling
            return self._chunk_code(document)
        
        else:
            # Default recursive splitting
            return self.splitters["recursive"].split_documents([document])
    
    def _chunk_code(self, document):
        """Special handling for code files"""
        # Split by functions/classes while preserving context
        code_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1500,
            chunk_overlap=300,
            separators=[
                "\nclass ",
                "\ndef ",
                "\n\n",
                "\n",
                " ",
                ""
            ]
        )
        return code_splitter.split_documents([document])
    
    def experimental_sliding_window(self, text, window_size=500, step=250):
        """Sliding window approach for maximum context preservation"""
        chunks = []
        for i in range(0, len(text), step):
            chunk = text[i:i + window_size]
            if len(chunk) > 100:  # Minimum chunk size
                chunks.append({
                    "content": chunk,
                    "start": i,
                    "end": min(i + window_size, len(text))
                })
        return chunks

# Demo: Testing Different Chunking Strategies
print("=== Smart Chunking Demo ===")

# Create sample documents of different types
sample_documents = [
    {
        "content": """# Introduction to RAG Systems

RAG (Retrieval Augmented Generation) represents a significant advancement in AI systems. Unlike traditional language models that rely solely on their training data, RAG systems can access external knowledge bases to provide more accurate and up-to-date information.

## How RAG Works

The RAG process involves several key steps:

1. **Document Processing**: Raw documents are processed and split into manageable chunks
2. **Embedding Generation**: Each chunk is converted into vector embeddings
3. **Vector Storage**: Embeddings are stored in specialized vector databases
4. **Query Processing**: User queries are converted to embeddings
5. **Retrieval**: Similar chunks are retrieved based on semantic similarity
6. **Generation**: Retrieved context is used to generate accurate responses

## Benefits

RAG systems offer several advantages over traditional approaches:
- Reduced hallucinations in AI responses
- Access to up-to-date information
- Source attribution for answers
- Domain-specific knowledge integration""",
        "type": "markdown",
        "source": "rag_guide.md"
    },
    {
        "content": """def create_vector_store(documents, embeddings):
    '''Create a vector store from processed documents'''
    chunks = []
    
    for doc in documents:
        # Split document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        doc_chunks = text_splitter.split_documents([doc])
        chunks.extend(doc_chunks)
    
    # Create vector store
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./vector_db"
    )
    
    return vector_store

class RAGPipeline:
    def __init__(self, vector_store):
        self.vector_store = vector_store
        self.retriever = vector_store.as_retriever(search_kwargs={"k": 5})
    
    def query(self, question):
        retrieved_docs = self.retriever.get_relevant_documents(question)
        return retrieved_docs""",
        "type": "code",
        "source": "rag_pipeline.py"
    }
]

# Initialize chunker
chunker = SmartChunker()

for doc_info in sample_documents:
    print(f"\n--- Processing {doc_info['source']} ({doc_info['type']}) ---")
    
    # Create mock document object
    class MockDocument:
        def __init__(self, content, source):
            self.page_content = content
            self.metadata = {"source": source}
    
    doc = MockDocument(doc_info["content"], doc_info["source"])
    
    # Test different chunking strategies
    print(f"Original length: {len(doc.page_content)} characters")
    
    # Try recursive splitter
    recursive_chunks = chunker.splitters["recursive"].split_documents([doc])
    print(f"Recursive chunks: {len(recursive_chunks)}")
    for i, chunk in enumerate(recursive_chunks[:2]):  # Show first 2 chunks
        print(f"  Chunk {i+1}: {len(chunk.page_content)} chars - '{chunk.page_content[:80]}...'")
    
    # Try document-type specific chunking
    if doc_info["type"] == "markdown":
        print("\nMarkdown-aware chunking:")
        try:
            md_chunks = chunker.chunk_by_document_type(doc)
            print(f"Markdown chunks: {len(md_chunks)}")
            for i, chunk in enumerate(md_chunks[:2]):
                chunk_text = chunk.page_content if hasattr(chunk, 'page_content') else str(chunk)
                print(f"  MD Chunk {i+1}: {len(chunk_text)} chars")
        except Exception as e:
            print(f"  Markdown chunking fallback to recursive")
    
    # Test sliding window
    print("\nSliding window chunking:")
    sliding_chunks = chunker.experimental_sliding_window(doc.page_content, window_size=300, step=150)
    print(f"Sliding window chunks: {len(sliding_chunks)}")
    for i, chunk in enumerate(sliding_chunks[:3]):
        print(f"  Window {i+1}: chars {chunk['start']}-{chunk['end']} (length: {len(chunk['content'])})")

print("\n=== Chunking Strategy Comparison ===")
test_text = "This is a sample document. It contains multiple sentences. Each sentence provides different information. We want to test how different chunking strategies handle this content. The goal is to preserve semantic meaning while creating manageable chunks."

strategies = {
    "recursive": chunker.splitters["recursive"],
    "token": chunker.splitters["token"]
}

for strategy_name, splitter in strategies.items():
    chunks = splitter.split_text(test_text)
    print(f"\n{strategy_name.title()} strategy:")
    print(f"  Number of chunks: {len(chunks)}")
    for i, chunk in enumerate(chunks):
        print(f"  Chunk {i+1}: '{chunk}'")

πŸ’‘ Splitter Types Explained:

RecursiveCharacterTextSplitter:

Most versatile - tries larger separators first (paragraphs) then falls back to smaller ones. Best for general text.

TokenTextSplitter:

Splits by token count - ensures chunks fit within LLM token limits. Uses tiktoken library.

MarkdownHeaderTextSplitter:

Preserves document structure by splitting at headers. Perfect for documentation.

🎯 Document-Specific Strategies:

  • β€’ Markdown files: Split by headers to preserve section context
  • β€’ Code files: Larger chunks (1500 chars) with more overlap to keep functions intact
  • β€’ PDFs/Text: Standard recursive splitting with semantic boundaries
  • β€’ CSV/Tables: Consider row-based or column-aware splitting

πŸ”„ Sliding Window Technique:

  • β€’ window_size=500: Each chunk contains 500 characters
  • β€’ step=250: Move forward by 250 chars (50% overlap)
  • β€’ Benefits: Never loses context at chunk boundaries
  • β€’ Trade-off: Creates more chunks, increases storage/search time
  • β€’ Use case: Legal documents, contracts where every detail matters
πŸ“Š

How to Manage Context Windows in RAG Systems

Step-by-Step Context Optimization Tutorial for Beginners

πŸ” Context Window Management:

  • β€’ Token counting: Critical for staying within LLM limits (GPT-4: 8k-32k, Claude: 100k)
  • β€’ Reserved tokens: Always reserve space for the model's response
  • β€’ Selection strategies: Choose which chunks to include based on different criteria
  • β€’ Graceful truncation: If needed, truncate intelligently at sentence boundaries

πŸ“¦ Required Dependencies:

Before running this context optimization code, install the required packages:

pip install tiktoken
from typing import List, Dict
import tiktoken

class ContextManager:
    """Learn how to manage context windows for RAG systems step-by-step"""
    
    def __init__(self, model_name="gpt-3.5-turbo", max_tokens=4000):
        self.encoding = tiktoken.encoding_for_model(model_name)
        self.max_tokens = max_tokens
        self.reserved_tokens = 500  # Reserve for response
        
    def count_tokens(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.encoding.encode(text))
    
    def optimize_context(self, 
                        documents: List[Dict], 
                        query: str,
                        strategy: str = "relevance") -> List[Dict]:
        """Optimize document selection for context window"""
        
        query_tokens = self.count_tokens(query)
        available_tokens = self.max_tokens - self.reserved_tokens - query_tokens
        
        if strategy == "relevance":
            return self._relevance_based_selection(documents, available_tokens)
        elif strategy == "diversity":
            return self._relevance_based_selection(documents, available_tokens)  # Simplified for demo
        elif strategy == "recency":
            return self._relevance_based_selection(documents, available_tokens)  # Simplified for demo
        else:
            return self._relevance_based_selection(documents, available_tokens)  # Simplified for demo
    
    def _relevance_based_selection(self, documents, available_tokens):
        """Select most relevant documents within token limit"""
        selected = []
        current_tokens = 0
        
        # Sort by relevance score
        sorted_docs = sorted(
            documents, 
            key=lambda x: x.get("score", 0), 
            reverse=True
        )
        
        for doc in sorted_docs:
            doc_tokens = self.count_tokens(doc["content"])
            if current_tokens + doc_tokens <= available_tokens:
                selected.append(doc)
                current_tokens += doc_tokens
            else:
                # Try to fit partial document
                remaining_tokens = available_tokens - current_tokens
                if remaining_tokens > 100:  # Minimum useful chunk
                    truncated_content = self._truncate_to_tokens(
                        doc["content"], 
                        remaining_tokens
                    )
                    doc_copy = doc.copy()
                    doc_copy["content"] = truncated_content
                    doc_copy["truncated"] = True
                    selected.append(doc_copy)
                break
        
        return selected
    
    def _truncate_to_tokens(self, text, max_tokens):
        """Truncate text to fit within token limit"""
        tokens = self.encoding.encode(text)
        if len(tokens) <= max_tokens:
            return text
        
        truncated_tokens = tokens[:max_tokens]
        return self.encoding.decode(truncated_tokens)
    
    def create_context_prompt(self, documents, query):
        """Create optimized prompt with context"""
        context_parts = []
        
        for i, doc in enumerate(documents):
            source = doc.get("metadata", {}).get("source", "Unknown")
            truncated = " (truncated)" if doc.get("truncated") else ""
            
            context_parts.append(
                f"[Document {i+1} - {source}{truncated}]\n"
                f"{doc['content']}\n"
            )
        
        context = "\n---\n".join(context_parts)
        
        return f"""Based on the following context, answer the question.
        
Context:
{context}

Question: {query}

Answer:"""

# Usage example - Context Optimization Demo
context_manager = ContextManager(max_tokens=4000)

# Simulate retrieved documents with realistic content
documents = [
    {
        "content": "RAG (Retrieval Augmented Generation) is a technique that combines information retrieval with text generation. It helps AI models access external knowledge to provide more accurate and up-to-date responses by retrieving relevant documents before generating answers.",
        "score": 0.95,
        "metadata": {"source": "rag_guide.pdf"}
    },
    {
        "content": "Vector databases like ChromaDB, Pinecone, and Weaviate are essential components of RAG systems. They store document embeddings and enable fast similarity search to find relevant context for user queries.",
        "score": 0.87,
        "metadata": {"source": "vector_db_tutorial.txt"}
    },
    {
        "content": "Traditional search relies on keyword matching, while RAG uses semantic similarity. This allows the system to understand meaning and context, making it more effective for question answering applications.",
        "score": 0.82,
        "metadata": {"source": "semantic_search.md"}
    },
]

print("=== Context Optimization Demo ===")
print(f"Maximum tokens: {context_manager.max_tokens}")
print(f"Reserved for response: {context_manager.reserved_tokens}")

query = "What is RAG and how does it work?"
query_tokens = context_manager.count_tokens(query)
print(f"Query tokens: {query_tokens}")

# Test different strategies
for strategy in ["relevance", "diversity"]:
    print(f"\n--- Strategy: {strategy} ---")
    
    # Optimize context
    optimized_docs = context_manager.optimize_context(
        documents, 
        query,
        strategy=strategy
    )
    
    print(f"Selected documents: {len(optimized_docs)}")
    
    total_tokens = query_tokens + context_manager.reserved_tokens
    for i, doc in enumerate(optimized_docs):
        doc_tokens = context_manager.count_tokens(doc["content"])
        total_tokens += doc_tokens
        truncated = " (TRUNCATED)" if doc.get("truncated") else ""
        print(f"  Doc {i+1}: {doc_tokens} tokens{truncated}")
    
    print(f"Total tokens used: {total_tokens}/{context_manager.max_tokens}")
    
    # Create and show sample prompt
    if strategy == "relevance":  # Show prompt for one strategy
        prompt = context_manager.create_context_prompt(optimized_docs, query)
        print(f"\nSample prompt preview (first 200 chars):")
        print(f"{prompt[:200]}...")

print("\n=== Token Counting Examples ===")
sample_texts = [
    "Hello world",
    "This is a longer sentence with more tokens to demonstrate counting.",
    "RAG systems are amazing!"
]

for text in sample_texts:
    token_count = context_manager.count_tokens(text)
    print(f"Text: '{text}' β†’ {token_count} tokens")

πŸ’‘ Key Components Explained:

Token Counting (tiktoken):

Essential for accurate context management. Different models use different tokenizers - tiktoken handles OpenAI models.

Reserved Tokens:

Always reserve 500-1000 tokens for the model's response. Running out of space mid-response is bad UX.

Selection Algorithm:

Greedy approach: add highest-scoring docs until space runs out. Considers partial inclusion of last doc.

πŸ“Š Context Selection Strategies:

Relevance-based:

Prioritizes highest-scoring chunks. Best for focused questions.

Diversity-based:

Includes varied perspectives. Good for comprehensive answers.

Recency-based:

Prioritizes newer content. Ideal for time-sensitive queries.

Balanced:

Combines multiple factors. Most robust for general use.

⚠️ Token Limit Considerations:

  • β€’ GPT-3.5: 4k tokens (budget) or 16k tokens (extended)
  • β€’ GPT-4: 8k, 32k, or 128k tokens depending on version
  • β€’ Claude: 100k tokens (huge context window)
  • β€’ Rule of thumb: 1 token β‰ˆ 0.75 words in English
⚑

RAG Performance Optimization

πŸ” Retrieval Optimization

Hybrid Search: Combines semantic (vector) + keyword (BM25) search for better coverage

# Hybrid search combining multiple strategies
from langchain.retrievers import (
    ContextualCompressionRetriever,
    EnsembleRetriever
)
from langchain.retrievers.document_compressors import (
    LLMChainExtractor
)

# Create ensemble retriever
semantic_retriever = vector_store.as_retriever(
    search_kwargs={"k": 10}
)

keyword_retriever = BM25Retriever.from_documents(
    documents
)
keyword_retriever.k = 10

ensemble = EnsembleRetriever(
    retrievers=[semantic_retriever, keyword_retriever],
    weights=[0.6, 0.4]
)

# Add compression
compressor = LLMChainExtractor.from_llm(llm)
compressed_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble
)

πŸ’‘ Benefits: Semantic search finds conceptually similar content, while BM25 catches exact keyword matches. Compression removes irrelevant parts.

πŸ’Ύ Caching Strategy

Query Caching: Stores results for repeated questions to reduce latency and API costs

from functools import lru_cache
import hashlib

class CachedRAG:
    def __init__(self, rag_system):
        self.rag = rag_system
        self.cache = {}
        
    @lru_cache(maxsize=1000)
    def _get_query_hash(self, query):
        """Create hash of query for caching"""
        return hashlib.md5(
            query.encode()
        ).hexdigest()
    
    def query_with_cache(self, query):
        """Query with caching"""
        query_hash = self._get_query_hash(query)
        
        if query_hash in self.cache:
            print("Cache hit!")
            return self.cache[query_hash]
        
        # Execute query
        result = self.rag.query(query)
        
        # Cache result
        self.cache[query_hash] = result
        
        return result

πŸ’‘ Benefits: Instant responses for common questions. Reduces embedding/LLM API costs by 50-80% in production.

πŸš€ Advanced Performance Techniques

Parallel Processing:
# Async retrieval for multiple queries
import asyncio
from concurrent.futures import ThreadPoolExecutor

async def parallel_retrieve(queries):
    with ThreadPoolExecutor() as executor:
        tasks = [
            executor.submit(rag.query, q) 
            for q in queries
        ]
        results = await asyncio.gather(*tasks)
    return results
Batch Embeddings:
# Process documents in batches
def batch_embed_documents(docs, batch_size=100):
    embeddings = []
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        batch_embeddings = embedding_model.embed_documents(
            [d.page_content for d in batch]
        )
        embeddings.extend(batch_embeddings)
    return embeddings

✨ RAG Best Practices

Document Processing

  • β€’ Clean and preprocess documents before chunking
  • β€’ Preserve document structure and metadata
  • β€’ Test different chunk sizes for your use case
  • β€’ Consider domain-specific chunking strategies

Retrieval Quality

  • β€’ Use hybrid search for better coverage
  • β€’ Implement relevance feedback loops
  • β€’ Monitor and optimize retrieval metrics
  • β€’ Consider query expansion techniques

🚧 Common RAG Challenges

Lost Context Problem

Issue: Important information split across chunks
Solution: Increase chunk overlap and use sliding windows

Retrieval Precision

Issue: Retrieved chunks not always most relevant
Solution: Implement reranking and query expansion

Hallucination Risk

Issue: LLM generates information not in context
Solution: Use strict prompts and fact-checking mechanisms

πŸŽ‰ Next Steps

Congratulations on building your first complete RAG system! You've mastered the fundamentals of document processing, retrieval, and generation. Next, you'll explore advanced RAG techniques including query expansion, reranking, and multi-modal RAG.