Phase 2Intermediate⏱ 90 minutes

Document Processing &
Text Splitting

Master document processing for retrieval augmented generation (RAG) applications. Learn how to split documents for RAG using LangChain text splitter, implement document chunking best practices, and optimize chunk sizes for maximum retrieval performance.

🎯

Learning Objectives

  • Master LangChain document loader for various file formats in RAG applications
  • Implement text splitting strategies and optimal chunk size for retrieval augmented generation
  • Implement metadata extraction and management systems
  • Build PDF to vector pipelines for retrieval augmented generation systems
📄

Why Document Processing Matters

Before building RAG applications, you need to master document chunking best practices using LangChain text splitter. Raw documents must be converted with optimal chunk size strategies to enable effective retrieval augmented generation.

❌ Poor Processing:

• Documents too large for context windows

• Lost formatting and structure

• Missing important metadata

• Inconsistent chunk boundaries

✅ Smart Processing:

• Optimal chunk sizes for context

• Preserved document structure

• Rich metadata extraction

• Semantic boundary awareness

🎯 Document Processing for RAG Applications:

  • Text Splitting Strategies: Apply optimal chunk size for retrieval augmented generation
  • Document Chunking Best Practices: Balance context preservation and retrieval precision
  • LangChain Text Splitter: Use RecursiveCharacterTextSplitter for intelligent splitting
  • PDF to Vector Pipeline: Convert documents for efficient RAG retrieval
📁

LangChain Document Loader for RAG Applications

Loading Different Document Types

Master how to split documents for RAG using LangChain document loader. Each loader implements document chunking best practices for retrieval augmented generation applications:

🔍 LangChain Document Processing Tutorial Components:

  • TextLoader: Process text files for RAG with optimal chunk size
  • PyPDFLoader: PDF to vector conversion with page-aware splitting
  • WebBaseLoader: Web content processing for retrieval augmented generation
  • DirectoryLoader: Batch document processing with text splitting strategies
from langchain_community.document_loaders import (
    TextLoader, 
    PyPDFLoader, 
    WebBaseLoader,
    DirectoryLoader
)
import os

# 1. Loading Text Files
def load_text_documents():
    """Load plain text files with proper encoding"""
    text_loader = TextLoader("sample_document.txt", encoding="utf-8")
    text_docs = text_loader.load()
    
    print(f"Loaded {len(text_docs)} text documents")
    print(f"First doc preview: {text_docs[0].page_content[:200]}...")
    print(f"Metadata: {text_docs[0].metadata}")
    
    return text_docs

# 2. Loading PDF Files
def load_pdf_documents():
    """Load PDF files page by page"""
    pdf_loader = PyPDFLoader("research_paper.pdf")
    pdf_docs = pdf_loader.load()
    
    print(f"Loaded {len(pdf_docs)} PDF pages")
    for i, doc in enumerate(pdf_docs[:3]):  # Show first 3 pages
        print(f"Page {i+1} preview: {doc.page_content[:150]}...")
        print(f"Page {i+1} metadata: {doc.metadata}")
    
    return pdf_docs

# 3. Loading Web Pages
def load_web_documents():
    """Load and clean web content"""
    urls = [
        "https://python.langchain.com/docs/get_started/introduction",
        "https://python.langchain.com/docs/modules/data_connection"
    ]
    
    web_loader = WebBaseLoader(urls)
    web_docs = web_loader.load()
    
    print(f"Loaded {len(web_docs)} web pages")
    for doc in web_docs:
        print(f"URL: {doc.metadata.get('source', 'Unknown')}")
        print(f"Content preview: {doc.page_content[:200]}...")
    
    return web_docs

# 4. Batch Loading with DirectoryLoader
def load_directory_documents():
    """Load all documents from a directory"""
    # Load all .txt and .pdf files from a directory
    loader = DirectoryLoader(
        "documents/",
        glob="**/*.{txt,pdf}",
        loader_cls=TextLoader,  # Default loader for text files
        loader_kwargs={'autodetect_encoding': True}
    )
    
    docs = loader.load()
    print(f"Loaded {len(docs)} documents from directory")
    
    # Group by file type
    file_types = {}
    for doc in docs:
        ext = doc.metadata['source'].split('.')[-1]
        file_types[ext] = file_types.get(ext, 0) + 1
    
    print(f"File types: {file_types}")
    return docs

# 5. Custom Loader with Error Handling
def load_documents_safely(file_paths):
    """Load documents with comprehensive error handling"""
    loaded_docs = []
    failed_files = []
    
    for file_path in file_paths:
        try:
            if file_path.endswith('.pdf'):
                loader = PyPDFLoader(file_path)
            elif file_path.endswith('.txt'):
                loader = TextLoader(file_path, encoding='utf-8')
            else:
                print(f"Unsupported file type: {file_path}")
                continue
                
            docs = loader.load()
            loaded_docs.extend(docs)
            print(f"✅ Successfully loaded: {file_path}")
            
        except Exception as e:
            print(f"❌ Failed to load {file_path}: {str(e)}")
            failed_files.append((file_path, str(e)))
    
    print(f"\nSummary:")
    print(f"Successfully loaded: {len(loaded_docs)} documents")
    print(f"Failed files: {len(failed_files)}")
    
    if failed_files:
        print("\nFailed files:")
        for file_path, error in failed_files:
            print(f"  {file_path}: {error}")
    
    return loaded_docs, failed_files

# Example usage
if __name__ == "__main__":
    # Create sample files for testing
    with open("sample_document.txt", "w", encoding="utf-8") as f:
        f.write("This is a sample document for testing LangChain loaders.\n" * 10)
    
    # Test different loaders
    text_docs = load_text_documents()
    
    # For PDF and web loading, you'll need actual files/URLs
    # pdf_docs = load_pdf_documents()
    # web_docs = load_web_documents()
    
    # Clean up
    if os.path.exists("sample_document.txt"):
        os.remove("sample_document.txt")

📖 Code Explanation:

  • TextLoader: Loads plain text files with automatic encoding detection to handle various character sets
  • PyPDFLoader: Extracts text from PDFs page by page, preserving page boundaries in metadata
  • WebBaseLoader: Fetches and cleans HTML content from web pages, removing tags and scripts
  • DirectoryLoader: Batch processes multiple files with glob patterns for efficient loading
  • Error Handling: Comprehensive try-catch blocks to handle missing files or format issues

💡 Expected Output:

Loaded 1 text documents
First doc preview: This is a sample document for testing LangChain loaders.
This is a sample document for testing LangChain loaders...
Metadata: {'source': 'sample_document.txt'}

⚠️ Common Issues and Solutions:

  • Encoding errors: Use encoding='utf-8' or autodetect_encoding=True
  • Large PDFs: Use PyPDFLoader.load_and_split() for automatic chunking
  • Web timeouts: Add requests_kwargs={"timeout": 30} to WebBaseLoader
  • Memory issues: Process files in batches rather than loading all at once
✂️

Text Splitting Strategies for Retrieval Augmented Generation

LangChain Text Splitter: Document Chunking Best Practices

Learn how to split documents for RAG using RecursiveCharacterTextSplitter with optimal chunk size strategies:

🔍 Text Splitting Strategies for RAG Applications:

  • Optimal Chunk Size: Balance context preservation with retrieval precision for RAG
  • Chunk Overlap: Maintain continuity for better retrieval augmented generation
  • Semantic Boundaries: Split at natural breaks for document coherence
  • RAG Optimization: Configure for your specific retrieval needs
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

# 1. Basic Recursive Character Splitting
def demonstrate_basic_splitting():
    """Show basic text splitting with different configurations"""
    
    # Sample long text
    sample_text = """
    Artificial Intelligence (AI) is a broad field that encompasses machine learning, 
    natural language processing, computer vision, and robotics.
    
    Machine Learning is a subset of AI that focuses on algorithms that can learn 
    from and make predictions or decisions based on data. Deep learning, a subset 
    of machine learning, uses neural networks with multiple layers.
    
    Natural Language Processing (NLP) deals with the interaction between computers 
    and human language. It includes tasks like sentiment analysis, machine translation, 
    and text summarization.
    
    Computer Vision enables machines to interpret and understand visual information 
    from the world. Applications include image recognition, object detection, 
    and facial recognition.
    
    Robotics combines AI with mechanical engineering to create intelligent machines 
    that can perform tasks autonomously or with human guidance.
    """ * 5  # Repeat to make it longer
    
    # Basic splitter configuration
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,        # Maximum characters per chunk
        chunk_overlap=50,      # Characters to overlap between chunks
        length_function=len,   # Function to measure length
        is_separator_regex=False,
    )
    
    chunks = text_splitter.split_text(sample_text)
    
    print(f"Original text length: {len(sample_text)} characters")
    print(f"Number of chunks: {len(chunks)}")
    print()
    
    # Display chunks with overlap visualization
    for i, chunk in enumerate(chunks[:3]):  # Show first 3 chunks
        print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")
        print(chunk[:150] + "..." if len(chunk) > 150 else chunk)
        print()
    
    return chunks

# 2. Token-Aware Splitting
def demonstrate_token_splitting():
    """Split text based on token count instead of characters"""
    import tiktoken
    
    # Use tiktoken for accurate token counting
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    
    def tiktoken_len(text):
        tokens = encoding.encode(text)
        return len(tokens)
    
    # Token-based splitter
    token_splitter = RecursiveCharacterTextSplitter(
        chunk_size=100,           # Maximum tokens per chunk
        chunk_overlap=20,         # Token overlap
        length_function=tiktoken_len,  # Use token counting
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    sample_text = """
    Large Language Models (LLMs) are AI systems trained on vast amounts of text data. 
    They can generate human-like text, answer questions, translate languages, and 
    perform many other language-related tasks. Popular LLMs include GPT-4, Claude, 
    and PaLM. These models have billions of parameters and require significant 
    computational resources for training and inference.
    """ * 3
    
    chunks = token_splitter.split_text(sample_text)
    
    print("Token-based splitting:")
    print(f"Original text tokens: {tiktoken_len(sample_text)}")
    print(f"Number of chunks: {len(chunks)}")
    
    for i, chunk in enumerate(chunks):
        token_count = tiktoken_len(chunk)
        print(f"Chunk {i+1}: {token_count} tokens")
        print(f"Content: {chunk[:100]}...")
        print()
    
    return chunks

# 3. Document-Aware Splitting
def demonstrate_document_splitting():
    """Split documents while preserving metadata"""
    from langchain.schema import Document
    
    # Create sample documents with metadata
    documents = [
        Document(
            page_content="Python is a high-level programming language. " * 20,
            metadata={"source": "python_guide.txt", "chapter": "Introduction"}
        ),
        Document(
            page_content="Machine learning algorithms can be supervised or unsupervised. " * 25,
            metadata={"source": "ml_textbook.pdf", "chapter": "Algorithms", "page": 42}
        )
    ]
    
    # Document splitter preserves metadata
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=150,
        chunk_overlap=30
    )
    
    split_docs = text_splitter.split_documents(documents)
    
    print("Document splitting with metadata preservation:")
    print(f"Original documents: {len(documents)}")
    print(f"Split documents: {len(split_docs)}")
    print()
    
    for i, doc in enumerate(split_docs):
        print(f"--- Split Document {i+1} ---")
        print(f"Content: {doc.page_content[:100]}...")
        print(f"Metadata: {doc.metadata}")
        print()
    
    return split_docs

# 4. Custom Separator Splitting
def demonstrate_custom_separators():
    """Use custom separators for specific document types"""
    
    # Code document with specific structure
    code_text = """
def process_data(data):
    '''Process input data'''
    cleaned_data = clean_data(data)
    return cleaned_data

class DataProcessor:
    def __init__(self, config):
        self.config = config
    
    def process(self, data):
        return self.transform(data)

# Configuration settings
CONFIG = {
    'batch_size': 32,
    'learning_rate': 0.001
}
    """
    
    # Code-specific separators
    code_splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=20,
        separators=[
            "\n\nclass ",      # Class definitions
            "\n\ndef ",       # Function definitions  
            "\n\n# ",         # Comments
            "\n\n",           # Double newlines
            "\n",              # Single newlines
            " ",                # Spaces
            ""                  # Characters
        ]
    )
    
    chunks = code_splitter.split_text(code_text)
    
    print("Code-aware splitting:")
    for i, chunk in enumerate(chunks):
        print(f"--- Code Chunk {i+1} ---")
        print(chunk)
        print()
    
    return chunks

# 5. Optimal Chunk Size Testing
def find_optimal_chunk_size(text, target_chunks=None):
    """Test different chunk sizes to find optimal configuration"""
    chunk_sizes = [100, 200, 500, 1000, 2000]
    results = []
    
    for size in chunk_sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size,
            chunk_overlap=size // 10  # 10% overlap
        )
        
        chunks = splitter.split_text(text)
        avg_length = sum(len(chunk) for chunk in chunks) / len(chunks)
        
        results.append({
            'chunk_size': size,
            'num_chunks': len(chunks),
            'avg_length': avg_length,
            'shortest': min(len(chunk) for chunk in chunks),
            'longest': max(len(chunk) for chunk in chunks)
        })
    
    print("Chunk size optimization results:")
    print(f"{'Size':<6} {'Chunks':<7} {'Avg Len':<8} {'Min':<6} {'Max':<6}")
    print("-" * 40)
    
    for result in results:
        print(f"{result['chunk_size']:<6} "
              f"{result['num_chunks']:<7} "
              f"{result['avg_length']:<8.0f} "
              f"{result['shortest']:<6} "
              f"{result['longest']:<6}")
    
    return results

# Example usage
if __name__ == "__main__":
    print("=== Text Splitting Demonstrations ===\n")
    
    # Run demonstrations
    basic_chunks = demonstrate_basic_splitting()
    print("\n" + "="*50 + "\n")
    
    # Uncomment if tiktoken is installed
    # token_chunks = demonstrate_token_splitting()
    # print("\n" + "="*50 + "\n")
    
    doc_chunks = demonstrate_document_splitting()
    print("\n" + "="*50 + "\n")
    
    code_chunks = demonstrate_custom_separators()
    print("\n" + "="*50 + "\n")
    
    # Test with sample text
    sample_text = "Lorem ipsum dolor sit amet. " * 100
    find_optimal_chunk_size(sample_text)

📖 Code Explanation:

  • RecursiveCharacterTextSplitter: Intelligently splits text at natural boundaries (paragraphs, sentences, words)
  • chunk_size: Maximum characters per chunk - balance between context and LLM limits
  • chunk_overlap: Overlapping text ensures context continuity between chunks
  • Token-based splitting: Uses tiktoken to count actual LLM tokens instead of characters
  • Custom separators: Define hierarchy for code, markdown, or domain-specific formats
  • Metadata preservation: Each chunk inherits parent document metadata

🎯 Choosing Chunk Sizes:

Small chunks (100-500 chars):

✅ More precise retrieval | ✅ Better for Q&A | ❌ May lose context | ❌ More chunks to manage

Medium chunks (500-1500 chars):

✅ Good balance | ✅ Preserves paragraph context | ✅ Works with most LLMs | 🔧 Most common choice

Large chunks (1500-4000 chars):

✅ Full context preserved | ✅ Good for summarization | ❌ Less precise retrieval | ❌ Higher token costs

💡 Splitting Results:

Original text length: 2460 characters
Number of chunks: 15

--- Chunk 1 (195 chars) ---
Artificial Intelligence (AI) is a broad field that encompasses machine learning, 
natural language processing, computer vision, and robotics...

--- Chunk 2 (198 chars) ---
Machine Learning is a subset of AI that focuses on algorithms that can learn 
from and make predictions or decisions based on data...

⚠️ Important Considerations:

  • Context windows: GPT-4 (8k-32k), Claude (100k), Gemini (1M) - adjust chunk sizes accordingly
  • Overlap strategy: 10-20% overlap is typical, more for technical content
  • Special formats: Code needs function boundaries, tables need row integrity
  • Performance: Smaller chunks = more embeddings to compute and store
🏷️

Metadata Management for RAG Applications

Extracting and Enriching Document Metadata

Enhance retrieval augmented generation with rich metadata. Implementing document chunking best practices includes tracking source information, chunk indices, and custom attributes for optimal RAG performance:

🔍 Metadata for RAG Applications:

  • RAG Filtering: Enable precise retrieval in retrieval augmented generation
  • Chunk Tracking: Monitor optimal chunk size and position in documents
  • Source Attribution: Track origins for RAG response generation
  • Performance Optimization: Use metadata for efficient RAG queries
from langchain_community.document_loaders import TextLoader
from langchain.schema import Document
from datetime import datetime
import os

# Simple metadata extraction example
def add_file_metadata(documents, file_path):
    """Add basic file metadata to documents"""
    # Get file info
    file_stats = os.stat(file_path)
    filename = os.path.basename(file_path)
    
    # Add metadata to each document
    for doc in documents:
        doc.metadata.update({
            'source': file_path,
            'filename': filename,
            'file_size': file_stats.st_size,
            'created_at': datetime.fromtimestamp(file_stats.st_ctime).isoformat(),
            'word_count': len(doc.page_content.split())
        })
    
    return documents

# Example 1: Adding basic metadata during loading
# First, create a sample file
with open("product_docs.txt", "w") as f:
    f.write("""Product Documentation - System Configuration Guide
    
This guide explains how to configure our system for optimal performance.
Configuration settings include memory allocation, thread pools, and cache sizes.
Please refer to the troubleshooting section for common issues.""")

loader = TextLoader("product_docs.txt")
documents = loader.load()

# Enrich with file metadata
documents = add_file_metadata(documents, "product_docs.txt")

# Add custom metadata
for doc in documents:
    doc.metadata['department'] = 'engineering'
    doc.metadata['doc_type'] = 'technical'
    doc.metadata['version'] = '1.0'

print(f"Document metadata: {documents[0].metadata}")

# Example 2: Using metadata for filtering
def filter_documents_by_metadata(documents, filters):
    """Filter documents based on metadata criteria"""
    filtered_docs = []
    
    for doc in documents:
        match = True
        for key, value in filters.items():
            if key not in doc.metadata or doc.metadata[key] != value:
                match = False
                break
        
        if match:
            filtered_docs.append(doc)
    
    return filtered_docs

# Filter documents by department
engineering_docs = filter_documents_by_metadata(
    documents, 
    {'department': 'engineering'}
)

# Filter by multiple criteria
recent_tech_docs = filter_documents_by_metadata(
    documents,
    {'department': 'engineering', 'version': '1.0'}
)

print(f"Found {len(engineering_docs)} engineering documents")

# Example 3: Working with metadata in practice
# Create a simple document store with metadata-based retrieval
class SimpleDocumentStore:
    """Simple in-memory document store with metadata filtering"""
    def __init__(self):
        self.documents = []
    
    def add_documents(self, documents):
        """Add documents to the store"""
        self.documents.extend(documents)
    
    def search(self, query, metadata_filter=None, max_results=3):
        """Search documents with optional metadata filtering"""
        # Filter by metadata if provided
        filtered_docs = self.documents
        if metadata_filter:
            filtered_docs = filter_documents_by_metadata(
                self.documents, 
                metadata_filter
            )
        
        # Simple keyword search (in production, use embeddings)
        results = []
        query_lower = query.lower()
        
        for doc in filtered_docs:
            if query_lower in doc.page_content.lower():
                results.append(doc)
                if len(results) >= max_results:
                    break
        
        return results

# Create and use the document store
doc_store = SimpleDocumentStore()
doc_store.add_documents(documents)

# Search with metadata filtering
results = doc_store.search(
    "configure",
    metadata_filter={"department": "engineering"}
)

print(f"Found {len(results)} relevant documents matching 'configure'")

# Cleanup - remove the sample file
import os
if os.path.exists("product_docs.txt"):
    os.remove("product_docs.txt")

📖 Code Explanation:

  • Automatic metadata: File stats, creation time, and word count extracted automatically
  • Custom metadata: Business-specific fields like department, version, and document type
  • Metadata filtering: Enable precise document retrieval based on any metadata field
  • SimpleDocumentStore: Example of metadata-aware document storage and retrieval
  • Metadata inheritance: When documents are split, chunks inherit parent metadata

🎯 Metadata Best Practices:

Essential Metadata:
  • • source (file path or URL)
  • • created_at / modified_at
  • • document_type
  • • chunk_index / total_chunks
Business Metadata:
  • • department / team
  • • project / client
  • • security_level
  • • version / status

💡 Expected Output:

Document metadata: {
  'source': 'product_docs.txt',
  'filename': 'product_docs.txt', 
  'file_size': 272,
  'created_at': '2025-01-27T21:14:38.916810',
  'word_count': 36,
  'department': 'engineering',
  'doc_type': 'technical',
  'version': '1.0'
}
Found 1 engineering documents
Found 1 relevant documents matching 'configure'

⚠️ Metadata Tips:

  • Consistency: Use standard field names across all documents
  • Validation: Ensure required metadata fields are always present
  • Size limits: Keep metadata compact - it's stored with every chunk
  • Security: Don't store sensitive data in metadata if using external services
🔄

Document Ingestion Pipeline

LangChain Document Processing Tutorial: Building RAG Pipelines

Learn how to split documents for RAG by building a complete PDF to vector pipeline. This tutorial combines LangChain text splitter with document chunking best practices for production RAG applications:

🔍 RAG Pipeline Components:

  • LangChain Document Loader: Process PDFs and text for retrieval augmented generation
  • Optimal Chunk Size: Apply text splitting strategies for RAG performance
  • Metadata Enrichment: Add tracking for better RAG retrieval
  • Vector Preparation: Ready documents for embedding and retrieval
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
import os
from datetime import datetime

# Simple document processing pipeline
def process_documents(directory_path, chunk_size=500, chunk_overlap=50):
    """Process all text files in a directory"""
    processed_docs = []
    
    # Find all text files
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            
            # Load document
            loader = TextLoader(file_path)
            documents = loader.load()
            
            # Split into chunks
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap
            )
            
            # Process each document
            for doc in documents:
                # Add metadata
                doc.metadata.update({
                    'processed_at': datetime.now().isoformat(),
                    'chunk_size': chunk_size,
                    'source_file': filename
                })
                
                # Split text
                chunks = text_splitter.split_documents([doc])
                
                # Add chunk metadata
                for i, chunk in enumerate(chunks):
                    chunk.metadata['chunk_index'] = i
                    chunk.metadata['total_chunks'] = len(chunks)
                    processed_docs.append(chunk)
    
    return processed_docs

# Example usage
if __name__ == "__main__":
    # Create sample directory and files
    os.makedirs("docs", exist_ok=True)
    
    # Create sample files
    with open("docs/intro.txt", "w") as f:
        f.write("Welcome to our product. " * 50)
    
    with open("docs/features.txt", "w") as f:
        f.write("Key features include: " * 30 + "AI-powered search, automatic categorization.")
    
    # Process documents
    processed = process_documents("docs", chunk_size=100, chunk_overlap=20)
    
    print(f"Processed {len(processed)} chunks from documents")
    
    # Show first chunk
    if processed:
        print(f"\nFirst chunk:")
        print(f"Content: {processed[0].page_content[:100]}...")
        print(f"Metadata: {processed[0].metadata}")
    
    # Cleanup
    import shutil
    if os.path.exists("docs"):
        shutil.rmtree("docs")

📖 Code Explanation:

  • Pipeline approach: Combines loading, splitting, and metadata in one flow
  • Batch processing: Handles multiple files efficiently with consistent settings
  • Chunk tracking: Each chunk knows its position (chunk_index) and total count
  • Processing metadata: Timestamps and settings recorded for debugging
  • Error resilience: Continue processing even if individual files fail

🎯 Pipeline Architecture:

Directory → Loader Selection → Document Loading → Text Splitting → Metadata Enrichment → Output
    │             │                   │                 │                    │              │
    └─ Scan ──────┴─ By extension ────┴─ Parse ─────────┴─ Chunk ────────────┴─ Enrich ─────┘

💡 Expected Output:

Processed 24 chunks from documents

First chunk:
Content: Welcome to our product. Welcome to our product. Welcome to our...
Metadata: {
  'source': 'docs/intro.txt',
  'processed_at': '2025-01-27T22:14:38.916810',
  'chunk_size': 100,
  'source_file': 'intro.txt',
  'chunk_index': 0,
  'total_chunks': 15
}

⚠️ Production Considerations:

  • Async processing: Use async/await for I/O operations at scale
  • Progress tracking: Implement callbacks or progress bars for large batches
  • Deduplication: Check if documents are already processed before re-processing
  • Error recovery: Save state to resume failed batch operations

✨ Document Processing Best Practices

Chunking Strategy

  • • Use semantic boundaries (paragraphs, sections)
  • • Maintain 10-20% overlap between chunks
  • • Optimize chunk size for your LLM's context window
  • • Consider content type when setting parameters

Metadata Management

  • • Extract source and structural information
  • • Include processing timestamps for debugging
  • • Add custom metadata for business logic
  • • Use consistent naming conventions

💡Document Processing Patterns

Academic Papers

Section-based chunking + Citation extraction + Abstract/conclusion emphasis

Business Documents

Hierarchical structure + Metadata tagging + Department/project categorization

Code Documentation

Function/class boundaries + API endpoint grouping + Language-specific parsing

🎉 Congratulations!

You've mastered document chunking best practices and learned how to split documents for RAG using LangChain text splitter! Next, you'll explore Vector Embeddings & Similarity Search to complete your retrieval augmented generation pipeline by converting your optimally chunked documents into searchable vectors.