Vector Embeddings
& Similarity Search
Master how to create embeddings for retrieval augmented generation (RAG) using LangChain. Build vector search systems with Chroma, Pinecone, and other vector databases to power intelligent RAG applications.
Learning Objectives
- Learn how to create embeddings for RAG applications with LangChain
- Master LangChain embeddings with OpenAI, Google, and other models for RAG
- Build RAG vector search with cosine similarity and vector databases
- Choose the right LangChain vector store: Chroma, Pinecone, Weaviate, or FAISS
📋 Prerequisites: Ready for Advanced Topics?
This advanced tutorial assumes you have:
- • Completed Phase 1 & 2 (AI Fundamentals & LangChain Essentials)
- • Strong Python programming and NumPy experience
- • Basic understanding of linear algebra and vectors
- • Familiarity with machine learning concepts
How to Create Embeddings for Retrieval Augmented Generation
In retrieval augmented generation (RAG), vector embeddings are the foundation of semantic search. Learn how to create embeddings using LangChain to transform your documents into searchable vectors that power RAG applications with Chroma, Pinecone, and other vector databases.
🔍 RAG Embedding Pipeline:
The RAG vector search pipeline: Documents → LangChain Embeddings → Vector Store (Chroma/Pinecone) → Similarity Search → Retrieved Context. This tutorial shows you how to build each component for production RAG applications.
Traditional Search (Without RAG) 🔍
- • Matches exact keywords
- • Misses synonyms and related concepts
- • No understanding of context
- • Query: "car" ≠ "automobile"
RAG Vector Search with LangChain 🧠
- • Understands meaning and intent
- • Finds related concepts
- • Context-aware matching
- • Query: "car" ≈ "automobile" ≈ "vehicle"
Creating Embeddings with LangChain for RAG Applications
Basic Embedding Generation
This embedding tutorial shows how to create embeddings for retrieval augmented generation using LangChain with Google Gemini and OpenAI models:
📦 Required Dependencies:
Before running this code, install the required packages:
import google.generativeai as genai
import numpy as np
from typing import List
import os
from dotenv import load_dotenv
load_dotenv()
# Configure Gemini
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
def get_embedding(text: str, model_name: str = "models/embedding-001") -> List[float]:
"""Get embedding for a single text using Gemini"""
model = genai.GenerativeModel(model_name)
result = genai.embed_content(
model=model_name,
content=text,
task_type="retrieval_document", # or "retrieval_query" for queries
title="Embedding"
)
return result['embedding']
def get_embeddings_batch(texts: List[str], model_name: str = "models/embedding-001") -> List[List[float]]:
"""Get embeddings for multiple texts efficiently"""
embeddings = []
# Process in batches (Gemini supports batch embedding)
for text in texts:
result = genai.embed_content(
model=model_name,
content=text,
task_type="retrieval_document"
)
embeddings.append(result['embedding'])
return embeddings
# Example usage
text = "Machine learning is a subset of artificial intelligence"
embedding = get_embedding(text)
print(f"Text: {text}")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
# Batch processing
texts = [
"Python is a programming language",
"Python is a type of snake",
"JavaScript is used for web development",
"Coffee is a popular beverage"
]
embeddings = get_embeddings_batch(texts)
print(f"\nGenerated {len(embeddings)} embeddings")
📖 Code Explanation:
- • get_embedding(): Generates a single embedding vector for text using Google's Gemini model
- • model_name: "models/embedding-001" is Google's general-purpose embedding model
- • task_type: "retrieval_document" for documents, "retrieval_query" for search queries
- • Embedding dimension: 768 dimensions capture semantic meaning in mathematical space
- • Batch processing: Process multiple texts efficiently to reduce API calls
🎯 Understanding Embeddings:
What are those 768 numbers?
Each dimension captures different semantic features: topics, tone, style, context. Similar texts have similar values across dimensions.
Why normalize embeddings?
Normalization (unit length) makes cosine similarity calculations faster and more stable. Most models already return normalized embeddings.
💡 Expected Output:
Text: Machine learning is a subset of artificial intelligence Embedding dimension: 768 First 10 values: [-0.0123, 0.0456, -0.0789, 0.0234, ...] Generated 4 embeddings
⚠️ Common Issues:
- • API Key: Ensure GOOGLE_API_KEY is set in .env file
- • Rate limits: Free tier has limits - implement exponential backoff
- • Text length: Maximum ~8000 tokens per text (varies by model)
- • Language support: Gemini embeddings support 100+ languages
Choosing Embedding Models
📊 Embedding Models for RAG Vector Stores (Chroma, Pinecone, Weaviate):
Model | Dimensions | Use Case | Cost |
---|---|---|---|
embedding-001 | 768 | General purpose, multilingual | Free tier available |
text-embedding-preview-0409 | 768 | Latest model, improved quality | Free tier available |
Alternative: Sentence Transformers | 384-1024 | Open source, self-hosted | Free |
# Compare embedding approaches
text = "The quick brown fox jumps over the lazy dog"
# Gemini embedding
gemini_embedding = get_embedding(text, model_name="models/embedding-001")
# Alternative: Using Sentence Transformers (open source)
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
st_model = SentenceTransformer('all-MiniLM-L6-v2')
st_embedding = st_model.encode(text)
# Compare results
comparison = {
"Gemini": {
"dimensions": len(gemini_embedding),
"sample_values": gemini_embedding[:5],
"magnitude": np.linalg.norm(gemini_embedding)
},
"Sentence-Transformers": {
"dimensions": len(st_embedding),
"sample_values": st_embedding[:5].tolist(),
"magnitude": np.linalg.norm(st_embedding)
}
}
# Display comparison
for model, info in comparison.items():
print(f"\nModel: {model}")
print(f"Dimensions: {info['dimensions']}")
print(f"Magnitude: {info['magnitude']:.4f}")
print(f"Sample: {[f'{v:.4f}' for v in info['sample_values']]}")
RAG Vector Search: Similarity Calculations for Retrieval
Vector Search for RAG: Choosing Similarity Metrics
🎯 Cosine Similarity for RAG
Standard for LangChain vector stores (Chroma, Pinecone, Weaviate)
- • Default in most RAG vector databases
- • Optimal for retrieval augmented generation
- • Best for: RAG applications, semantic search
📏 Alternative: Euclidean Distance
Used in FAISS and specialized vector databases
- • Supported by FAISS for large-scale RAG
- • Alternative for specific use cases
- • Less common in standard RAG pipelines
📦 Required Dependencies:
Before running this similarity calculation code, install the required packages:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Tuple
def calculate_cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
"""Calculate cosine similarity between two vectors"""
vec1 = np.array(vec1)
vec2 = np.array(vec2)
# Manual calculation
dot_product = np.dot(vec1, vec2)
magnitude1 = np.linalg.norm(vec1)
magnitude2 = np.linalg.norm(vec2)
if magnitude1 == 0 or magnitude2 == 0:
return 0.0
return dot_product / (magnitude1 * magnitude2)
def calculate_euclidean_distance(vec1: List[float], vec2: List[float]) -> float:
"""Calculate Euclidean distance between two vectors"""
return np.linalg.norm(np.array(vec1) - np.array(vec2))
def find_most_similar(
query_embedding: List[float],
embeddings: List[List[float]],
texts: List[str],
method: str = "cosine"
) -> List[Tuple[str, float]]:
"""Find most similar texts to query"""
similarities = []
for i, embedding in enumerate(embeddings):
if method == "cosine":
similarity = calculate_cosine_similarity(query_embedding, embedding)
else: # euclidean
# Convert distance to similarity (inverse)
distance = calculate_euclidean_distance(query_embedding, embedding)
similarity = 1 / (1 + distance) # Convert to 0-1 range
similarities.append((texts[i], similarity))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities
# Example: How to create embeddings for RAG applications
# These documents would typically come from your knowledge base
documents = [
"The cat sat on the mat",
"A feline rested on the rug",
"The dog played in the park",
"Machine learning is fascinating",
"Deep learning is a subset of ML"
]
# Create embeddings for your vector database (Chroma, Pinecone, etc.)
# Note: You need the get_embedding and get_embeddings_batch functions from the previous section
import google.generativeai as genai
import os
from dotenv import load_dotenv
load_dotenv()
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
def get_embedding(text: str, model_name: str = "models/embedding-001") -> List[float]:
"""Get embedding for a single text using Gemini"""
result = genai.embed_content(
model=model_name,
content=text,
task_type="retrieval_document"
)
return result['embedding']
def get_embeddings_batch(texts: List[str]) -> List[List[float]]:
"""Get embeddings for multiple texts efficiently"""
embeddings = []
for text in texts:
embedding = get_embedding(text)
embeddings.append(embedding)
return embeddings
embeddings = get_embeddings_batch(documents)
# User query for retrieval augmented generation
query = "A kitten was sleeping on the carpet"
query_embedding = get_embedding(query)
# RAG vector search - find similar documents for context
cosine_results = find_most_similar(query_embedding, embeddings, documents, method="cosine")
euclidean_results = find_most_similar(query_embedding, embeddings, documents, method="euclidean")
print("Query:", query)
print("\nCosine Similarity Results:")
for text, score in cosine_results[:3]:
print(f" {score:.4f}: {text}")
print("\nEuclidean Distance Results:")
for text, score in euclidean_results[:3]:
print(f" {score:.4f}: {text}")
📖 Code Explanation:
- • Cosine Similarity: Measures angle between vectors (dot product / magnitude product)
- • Range [−1, 1]: 1 = identical direction, 0 = perpendicular, −1 = opposite
- • Euclidean Distance: Straight-line distance in n-dimensional space
- • find_most_similar(): Ranks all texts by similarity to query
- • Distance to similarity: Using 1/(1+distance) to convert distance to 0-1 range
🎯 When to Use Which Metric:
- • Text similarity (most common)
- • When magnitude doesn't matter
- • Normalized embeddings
- • High-dimensional data
- • Clustering algorithms
- • When magnitude matters
- • Low-dimensional data
- • Anomaly detection
💡 Expected Output:
Query: A kitten was sleeping on the carpet Cosine Similarity Results: 0.8923: A feline rested on the rug 0.8156: The cat sat on the mat 0.6234: The dog played in the park Euclidean Distance Results: 0.0412: A feline rested on the rug 0.0389: The cat sat on the mat 0.0234: The dog played in the park
⚠️ RAG Performance Tips:
- • Use LangChain vector stores: Chroma for local, Pinecone for cloud-scale RAG
- • FAISS for production: Facebook AI's library for billion-scale vector search
- • Weaviate for hybrid: Combines vector and keyword search in RAG pipelines
- • Batch processing: Create embeddings in batches before storing
Visualizing High-Dimensional Embeddings
Dimensionality Reduction for Visualization
Since embeddings are high-dimensional (1536+ dimensions), we need to reduce them to 2D or 3D for visualization:
📦 Required Dependencies:
Before running this visualization code, install the required packages:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
from typing import List
import google.generativeai as genai
import os
from dotenv import load_dotenv
load_dotenv()
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
def get_embedding(text: str, model_name: str = "models/embedding-001") -> List[float]:
"""Get embedding for a single text using Gemini"""
result = genai.embed_content(
model=model_name,
content=text,
task_type="retrieval_document"
)
return result['embedding']
def get_embeddings_batch(texts: List[str]) -> List[List[float]]:
"""Get embeddings for multiple texts efficiently"""
embeddings = []
for text in texts:
embedding = get_embedding(text)
embeddings.append(embedding)
return embeddings
def calculate_cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
"""Calculate cosine similarity between two vectors"""
vec1 = np.array(vec1)
vec2 = np.array(vec2)
# Manual calculation
dot_product = np.dot(vec1, vec2)
magnitude1 = np.linalg.norm(vec1)
magnitude2 = np.linalg.norm(vec2)
if magnitude1 == 0 or magnitude2 == 0:
return 0.0
return dot_product / (magnitude1 * magnitude2)
def reduce_dimensions(embeddings: List[List[float]], method: str = "pca") -> np.ndarray:
"""Reduce embeddings to 2D for visualization"""
embeddings_array = np.array(embeddings)
if method == "pca":
reducer = PCA(n_components=2, random_state=42)
else: # t-SNE
reducer = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1))
reduced = reducer.fit_transform(embeddings_array)
return reduced
def visualize_embeddings(embeddings: List[List[float]], texts: List[str], title: str = "Embedding Visualization"):
"""Create 2D visualization of embeddings"""
# Reduce dimensions
reduced = reduce_dimensions(embeddings, method="pca")
# Create scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(reduced[:, 0], reduced[:, 1], alpha=0.6, s=100)
# Add labels
for i, text in enumerate(texts):
plt.annotate(text[:30] + "..." if len(text) > 30 else text,
xy=(reduced[i, 0], reduced[i, 1]),
xytext=(5, 5), textcoords='offset points',
fontsize=9, alpha=0.7)
plt.title(title)
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Example: Visualize semantic clusters
categories = {
"Animals": ["cat", "dog", "elephant", "tiger", "mouse"],
"Food": ["pizza", "burger", "salad", "soup", "pasta"],
"Technology": ["computer", "smartphone", "AI", "robot", "software"],
"Sports": ["football", "basketball", "tennis", "swimming", "running"]
}
all_texts = []
all_categories = []
for category, items in categories.items():
all_texts.extend(items)
all_categories.extend([category] * len(items))
# Get embeddings
embeddings = get_embeddings_batch(all_texts)
# Visualize
visualize_embeddings(embeddings, all_texts, "Semantic Clusters Visualization")
# Calculate inter and intra-category similarities
def analyze_category_similarities(embeddings, categories, texts):
"""Analyze similarities within and between categories"""
results = {}
for cat1 in categories:
cat1_indices = [i for i, t in enumerate(texts) if t in categories[cat1]]
cat1_embeddings = [embeddings[i] for i in cat1_indices]
# Intra-category similarity (within same category)
intra_sims = []
for i in range(len(cat1_embeddings)):
for j in range(i+1, len(cat1_embeddings)):
sim = calculate_cosine_similarity(cat1_embeddings[i], cat1_embeddings[j])
intra_sims.append(sim)
avg_intra = np.mean(intra_sims) if intra_sims else 0
results[cat1] = {"intra_similarity": avg_intra}
print("Category Cohesion Analysis:")
print("-" * 40)
for category, metrics in results.items():
print(f"{category}: {metrics['intra_similarity']:.3f}")
analyze_category_similarities(embeddings, categories, all_texts)
📖 Code Explanation:
- • PCA (Principal Component Analysis): Linear dimensionality reduction, preserves global structure
- • t-SNE: Non-linear reduction, better for visualizing clusters but slower
- • reduce_dimensions(): Converts 768D vectors to 2D points for plotting
- • analyze_category_similarities(): Quantifies how well categories cluster
- • Intra-category similarity: Average similarity within each category (higher = better clustering)
🎯 Visualization Insights:
Expected Clustering Pattern:
Animals cluster together, separate from Food items. Technology and Sports form their own regions. Items within categories are closer than items across categories.
Category Cohesion Scores:
Animals: ~0.85 | Food: ~0.82 | Technology: ~0.87 | Sports: ~0.84 (Higher scores indicate tighter semantic grouping)
💡 What to Look For:
- • Similar items cluster together in the visualization
- • Categories form distinct regions in the embedding space
- • Higher intra-category similarity indicates better semantic grouping
- • Distance in the plot roughly corresponds to semantic distance
⚠️ Visualization Limitations:
- • Information loss: 768D → 2D loses ~95% of information
- • t-SNE distances: Only local distances are meaningful, not global
- • Random initialization: t-SNE plots vary between runs
- • Perplexity parameter: Affects cluster appearance in t-SNE
✨ Embedding Best Practices
Model Selection
- • Start with smaller models for prototyping
- • Use larger models for production accuracy
- • Consider multilingual models for global apps
- • Cache embeddings to reduce API costs
Performance Tips
- • Batch API requests (up to 2048 inputs)
- • Normalize embeddings for faster similarity
- • Use approximate nearest neighbors for scale
- • Pre-compute embeddings for static content
🔧 Common Embedding Patterns
🔍 Semantic Search
Find documents by meaning, not keywords
Query → Embedding → Similarity → Results
🎯 Clustering
Group similar documents automatically
Embeddings → K-means → Clusters
🤖 Classification
Categorize text using embeddings
Embedding → Classifier → Category
🎉 Next Steps
You've mastered how to create embeddings for retrieval augmented generation! Next, implement LangChain vector stores with Chroma, Pinecone, Weaviate and FAISS to build production RAG applications with efficient vector search and retrieval.