Skip to content

Latest commit

 

History

History
584 lines (439 loc) · 19.3 KB

File metadata and controls

584 lines (439 loc) · 19.3 KB

Elasticsearch vs Vector Search — A Data Engineer's Guide

TL;DR: Keyword search isn't dead, it's just unsexy. This project demonstrates when to use Elasticsearch, when to use vector search (pgvector), and when to combine both for production search systems.


What This Project Demonstrates

This hands-on project validates every key insight from the blog post with working code:

  1. When keyword search wins: Exact IDs, error codes, SKUs, logs, and metrics
  2. When vector search wins: Vague queries, semantic understanding, conceptual similarity
  3. The hybrid approach: Filter 1M docs to 1K with keywords, then vector search the rest
  4. Cost implications: Embedding pipeline overhead, RAM requirements, query latency
  5. Decision framework: A practical guide for choosing the right approach

Architecture

┌─────────────────────┐         ┌──────────────────────┐
│   Elasticsearch     │         │  PostgreSQL+pgvector │
│  (Keyword Search)   │         │  (Semantic Search)   │
├─────────────────────┤         ├──────────────────────┤
│ • Exact matches     │         │ • Embeddings (384d)  │
│ • Boolean logic     │         │ • Cosine similarity  │
│ • Fuzzy search      │         │ • Semantic ranking   │
│ • Aggregations      │         │ • Conceptual search  │
│ • Fast filtering    │         │ • Vague queries      │
└─────────────────────┘         └──────────────────────┘
         │                               │
         └───────────┬───────────────────┘
                     │
              ┌──────▼────────┐
              │ Hybrid Search │
              ├───────────────┤
              │ 1. ES Filter  │
              │ 2. Vector Rank│
              └───────────────┘

Tech Stack:

  • Elasticsearch 8.11: Keyword search, filtering, aggregations
  • PostgreSQL 16 + pgvector: Vector storage and similarity search
  • sentence-transformers: Embedding generation (all-MiniLM-L6-v2, 384 dimensions)
  • Python: Search implementations and demos
  • Docker Compose: One-command deployment

Quick Start

Prerequisites

  • Docker & Docker Compose
  • 4GB+ RAM available
  • ~2GB disk space

1. Start the Stack

cd elasticsearch-vs-vector-search
docker compose up -d

Wait for services to be healthy (~30 seconds):

docker compose ps

You should see all services healthy:

  • elastic-search (port 9200)
  • postgres-vector (port 5432)
  • search-demo-api (container for running scripts)

2. Generate and Load Data

# Generate 1,000 sample products
docker compose exec demo-api python scripts/generate_data.py

# Load into Elasticsearch and pgvector (this shows the cost difference!)
docker compose exec demo-api python scripts/load_data.py

Watch the output — you'll see:

  • Elasticsearch indexes 1,000 docs in ~1-2 seconds
  • Vector pipeline (embedding + insert) takes 5-10x longer
  • This validates: "Every data change triggers re-embedding"

3. Run the Comprehensive Demo

docker compose exec demo-api python scripts/demo.py

This interactive demo walks through all scenarios from the blog:

  • Exact matches (where keyword wins)
  • Semantic queries (where vector wins)
  • Cost comparisons
  • Hybrid approach
  • Decision framework

Individual Demos

Keyword Search Demo

docker compose exec demo-api python scripts/keyword_search.py

Demonstrates:

  • ✅ Exact SKU/error code lookup (milliseconds)
  • ✅ Boolean logic (AND, OR, NOT)
  • ✅ Fuzzy search for typos
  • ✅ Fast filtering and aggregations

Key Insight: "If users type exact IDs, error codes, or SKUs → vector search is expensive theater"

Vector Search Demo

docker compose exec demo-api python scripts/vector_search.py

Demonstrates:

  • ✅ Semantic understanding of vague queries
  • ✅ Conceptual similarity across categories
  • ⚠️ Embedding overhead on every query
  • ⚠️ Cannot do boolean NOT natively

Key Insight: "Cosine similarity doesn't understand NOT"

Hybrid Search Demo

docker compose exec demo-api python scripts/hybrid_search.py

Demonstrates:

  • ✅ Filter with Elasticsearch (1M → 1K docs)
  • ✅ Rank with vectors (semantic relevance)
  • ✅ Best of both worlds
  • ✅ Production-ready approach

Key Insight: "Hybrid search really means: filter 1M docs to 1K with keywords, then vector search the rest"


Project Structure

elasticsearch-vs-vector-search/
├── docker-compose.yml          # Infrastructure setup
├── Dockerfile                  # Python environment
├── requirements.txt            # Python dependencies
├── README.md                   # This file
│
├── config/
│   └── init.sql               # PostgreSQL schema with pgvector
│
├── scripts/
│   ├── generate_data.py       # Generate sample product data
│   ├── load_data.py           # Load into ES + pgvector
│   ├── keyword_search.py      # Elasticsearch search module
│   ├── vector_search.py       # pgvector semantic search
│   ├── hybrid_search.py       # Combined approach
│   └── demo.py                # Comprehensive demo
│
└── data/
    └── products.json          # Generated sample data

Sample Data

The project generates a realistic product catalog with:

  • 1,000 products across 4 categories:

    • Electronics (wireless mouse, keyboard, webcam, etc.)
    • Office supplies (desk, chair, organizer, etc.)
    • Home goods (coffee maker, blender, vacuum, etc.)
    • Sports & fitness (yoga mat, dumbbells, treadmill, etc.)
  • Realistic attributes:

    • Unique SKUs (e.g., ELEC-000042)
    • Product names and descriptions
    • Prices ($9.99 - $999.99)
    • Stock quantities
    • Error codes (10% of products — for demonstrating exact match scenarios)

This data perfectly demonstrates when to use each search approach.


Key Scenarios Demonstrated

✅ Scenario 1: Exact ID Lookup

Use Case: Customer service rep has product SKU Best Approach: Keyword search Why: Instant exact match, no embeddings needed

# Elasticsearch: <5ms
result = keyword_searcher.search_by_sku("ELEC-000042")

✅ Scenario 2: Error Code Search

Use Case: DevOps searching logs for error code ERR-1001 Best Approach: Keyword search Why: Structured data, exact match, fast aggregations

# Elasticsearch: <5ms, with aggregations
result = keyword_searcher.search_by_error_code("ERR-1001")

✅ Scenario 3: Boolean Logic

Use Case: Find wireless products BUT NOT gaming Best Approach: Keyword search Why: Vector search can't do NOT natively

# Elasticsearch: native boolean support
result = keyword_searcher.search_with_boolean_logic(
    must_have=["wireless"],
    must_not_have=["gaming"]
)

✅ Scenario 4: Vague Conceptual Query

Use Case: "something to help me work from home comfortably" Best Approach: Vector search Why: User describes intent, not exact terms

# pgvector: semantic understanding
result = vector_searcher.semantic_search(
    "something to help me work from home comfortably"
)

✅ Scenario 5: Production E-commerce Search

Use Case: "wireless audio under $100 in stock" Best Approach: Hybrid search Why: Filter millions → thousands, then semantic rank

# Hybrid: ES filter → vector rank
result = hybrid_searcher.hybrid_search(
    query="wireless audio",
    max_price=100,
    in_stock_only=True
)

Performance Comparison

From the actual demo (1,000 products):

Operation Keyword Search Vector Search Notes
Data Loading ~1-2 sec ~10-20 sec Vector requires embedding generation
Exact Match <5ms ~50-100ms Embedding overhead is wasteful
Text Query ~10-20ms ~50-150ms Vector adds embedding + compute time
Filtered Query ~5-10ms ~30-80ms ES excels at filtering
Hybrid Search N/A ~30-100ms Best of both: fast filter + semantic rank

Scaling Impact (from blog insights):

Dataset Size ES Index Time Vector Pipeline Time Vector RAM
1K products ~1-2 sec ~10-20 sec ~1.5 MB
10K products ~10-20 sec ~100-200 sec ~15 MB
100K products ~1-2 min ~15-30 min ~150 MB
1M products ~10-20 min ~2-5 hours ~1.5 GB
10M products ~2-3 hours ~20-50 hours ~15 GB

Note: Vector times include embedding generation (the "expensive pipeline" mentioned in the blog)


Blog Insights Validated

1. "Keyword search isn't dead, it's just unsexy"

Validated: Exact matches, error codes, and structured data searches are instant with Elasticsearch

2. "Vector search is expensive theater for exact matches"

Validated: Loading demo shows 5-10x time difference; exact SKU search wastes embedding inference

3. "Every data change triggers re-embedding"

Validated: Update a product description = regenerate embeddings = pipeline latency + cost

4. "pgvector wins for most teams"

Validated: SQL + vectors in one database simpler than managing Elastic + Pinecone separately

5. "Hybrid search really means: filter 1M to 1K, then vector the rest"

Validated: Demo shows ES filter (5ms) → vector rank (50ms) = best performance + relevance

6. "768d vs 1536d can double your infra cost"

Validated: Our 384d embeddings = 1.5MB per 1K docs; 1536d would be 4x that = 6MB

7. "Elasticsearch indexes in seconds, vector DBs take minutes"

Validated: 1,000 docs: ES ~1sec, pgvector ~10-20sec (10-20x slower)

8. "Cosine similarity doesn't understand NOT"

Validated: Vector search demo requires manual filtering; ES has native boolean NOT

9. "Fix typos before adding embeddings"

Validated: ES fuzzy matching handles typos instantly without costly embeddings

10. "The best search stack is boring"

Validated: Elastic for filters, vectors only when meaning matters = production-ready


Decision Framework

Use this simple flowchart:

┌─────────────────────────────────────┐
│ Do users type exact IDs/codes/SKUs? │
└──────────┬──────────────────────────┘
           │
       Yes │ No
           │
    ┌──────▼────────┐         ┌──────────────────────────┐
    │ KEYWORD SEARCH │         │ Is it logs/metrics/      │
    │ (Elasticsearch)│         │ time-series data?        │
    └────────────────┘         └──────┬───────────────────┘
                                      │
                                  Yes │ No
                                      │
                               ┌──────▼────────┐    ┌────────────────────────┐
                               │ KEYWORD SEARCH │    │ Can users articulate   │
                               │ (Elasticsearch)│    │ exact query terms?     │
                               └────────────────┘    └──────┬─────────────────┘
                                                            │
                                                        Yes │ No
                                                            │
                                         ┌──────────────────▼─┐    ┌─────────────┐
                                         │ Do you need boolean│    │   VECTOR    │
                                         │ logic (NOT, etc.)? │    │   SEARCH    │
                                         └──────┬──────────────┘    │ (pgvector)  │
                                                │                   └─────────────┘
                                            Yes │ No
                                                │
                                         ┌──────▼────────┐    ┌──────────────┐
                                         │ KEYWORD SEARCH │    │    HYBRID    │
                                         │ (Elasticsearch)│    │    SEARCH    │
                                         └────────────────┘    │ (ES + vector)│
                                                               └──────────────┘

Simple Rule of Thumb:

  • Exact match → Keyword
  • Structured data → Keyword
  • Boolean logic → Keyword
  • Vague/semantic → Vector
  • Large dataset with filters → Hybrid

Common Use Cases

Use Case Recommended Approach Example
E-commerce product search Hybrid Filter by category/price, rank by semantic relevance
Customer support knowledge base Hybrid Filter by category, find semantically similar articles
Error log search Keyword Exact error codes, fast aggregations
Code search Keyword Exact function names, boolean logic
Document discovery Vector Find conceptually similar documents
Metrics/time-series Keyword Exact timestamps, fast filtering
SKU/ID lookup Keyword Instant exact match
FAQ chatbot Hybrid Filter by category, semantic question matching
Recommendation engine Vector Find similar items based on description
Compliance document search Hybrid Filter by date/type, semantic content search

Customization

Use Different Embedding Model

Edit scripts/load_data.py and scripts/vector_search.py:

# Current: all-MiniLM-L6-v2 (384 dimensions)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Option: all-mpnet-base-v2 (768 dimensions, better quality)
model = SentenceTransformer("all-mpnet-base-v2")

# Option: OpenAI embeddings (requires API key)
# Update docker-compose.yml with OPENAI_API_KEY

Remember: Dimension changes require schema updates in config/init.sql!

Add More Data

# Edit generate_data.py to increase count
docker compose exec demo-api python scripts/generate_data.py

# Reload data
docker compose exec demo-api python scripts/load_data.py

Test Your Own Queries

# Interactive Python shell
docker compose exec demo-api python

from keyword_search import KeywordSearch
from vector_search import VectorSearch
from hybrid_search import HybridSearch

# Try your queries
searcher = HybridSearch()
results = searcher.hybrid_search("your query here", category="electronics")

Monitoring & Debugging

Check Elasticsearch

# Cluster health
curl http://localhost:9200/_cluster/health?pretty

# Index stats
curl http://localhost:9200/products/_stats?pretty

# Sample search
curl -X POST http://localhost:9200/products/_search?pretty \
  -H 'Content-Type: application/json' \
  -d '{"query": {"match": {"name": "wireless"}}}'

Check PostgreSQL + pgvector

# Connect to database
docker compose exec postgres psql -U searchuser -d searchdb

# Check table
SELECT COUNT(*) FROM products;

# Check vector extension
SELECT * FROM pg_extension WHERE extname = 'vector';

# Sample vector search
SELECT name, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) as similarity
FROM products
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

View Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f elasticsearch
docker compose logs -f postgres

Cleanup

# Stop services (keep data)
docker compose stop

# Stop and remove containers (keep data)
docker compose down

# Remove everything including data volumes
docker compose down -v

Costs & Production Considerations

Development (This Demo)

  • Compute: Minimal (runs on laptop)
  • Storage: ~2GB total
  • RAM: ~1GB for ES, ~500MB for PG
  • Embedding: Free (local sentence-transformers)

Production Estimates (1M products)

Option 1: Keyword Only (Elasticsearch)

  • Cost: ~$100-200/month
  • RAM: ~2-4GB
  • Storage: ~10GB
  • Latency: <10ms
  • Pros: Cheap, fast, simple
  • Cons: No semantic understanding

Option 2: Vector Only (e.g., Pinecone)

  • Cost: ~$500-1000/month (1M vectors, 384d)
  • RAM: ~2GB just for vectors
  • Latency: ~50-100ms (includes embedding)
  • Pros: Semantic search
  • Cons: Expensive, can't do boolean logic, slower

Option 3: Hybrid (ES + pgvector)Recommended

  • Cost: ~$150-300/month
  • RAM: ~3-5GB total
  • Storage: ~15GB
  • Latency: ~30-80ms (filter + rank)
  • Pros: Best of both, single DB (pgvector), manageable cost
  • Cons: Slightly more complex setup

Embedding Costs (if using OpenAI API):

  • text-embedding-3-small: $0.02 per 1M tokens
  • 1M products × 50 tokens avg = 50M tokens = $1 one-time
  • Re-embedding on updates adds ongoing cost

Learning Resources

Elasticsearch

pgvector

Embeddings

Hybrid Search


Contributing

Found an issue or have an improvement?

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

This project is part of the simple-dataengineering-ai-stack repository.


Key Takeaways

  1. Start with keyword search — it handles 80% of use cases perfectly
  2. Add vector search only when semantic understanding matters
  3. Use hybrid approach for production e-commerce and content platforms
  4. pgvector + PostgreSQL is simpler than separate vector DBs for most teams
  5. Fix data quality first — garbage embeddings lose to well-tuned keyword search
  6. The best search stack is boring — Elastic for filters, vectors for meaning

Happy Searching! 🔍