Skip to content

ThinkWorks/atlas-search-deduplication-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MongoDB Atlas Search Customer Deduplication Platform

πŸš€ Enterprise-grade intelligent duplicate detection using MongoDB Atlas Search with advanced fuzzy matching, similarity scoring, and modern web interface.

✨ Features

  • πŸ” Advanced Duplicate Detection - Multi-field fuzzy search with configurable similarity thresholds
  • ⚑ Real-time Performance - Sub-100ms Atlas Search queries with concurrent execution
  • 🎯 Smart Confidence Scoring - 160-point weighted algorithm with business confidence levels
  • πŸ–₯️ Modern Web Interface - Professional customer support dashboard with merge workflows
  • 🎬 Demo Mode - Interactive presentation features with performance metrics and ROI calculations
  • πŸ”§ Configurable Settings - Field scores, fuzzy matching, concurrent search options
  • πŸ›‘οΈ Enterprise Ready - Structured logging, health checks, comprehensive error handling
  • πŸ“Š REST API - Full API with OpenAPI documentation for CRM integration

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • MongoDB Atlas cluster with Atlas Search enabled
  • Git

Installation & Setup

# 1. Clone and setup
git clone <repository-url>
cd atlas-search-deduplication-demo

# 2. Configure environment
cp env.example .env
# Edit .env with your MongoDB Atlas connection string

# 3. Install dependencies
pip install -r requirements.txt

# 4. Verify setup
python3 -c "from dotenv import load_dotenv; load_dotenv(); import os; print('βœ… Environment OK' if os.getenv('MONGODB_URI') else '❌ MONGODB_URI missing')"

# 5. Generate sample data
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15

# 6. Create Atlas Search index (see detailed instructions below)

# 7. Start application
./start_app.sh

πŸ“Š Data Generation

Generate Sample Customer Data

The application includes a sophisticated data generator that creates realistic customer records with controlled duplicates:

# Basic data generation (10K records, 15% duplicates)
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15

# Large dataset for performance testing (100K records)
python3 data_generator.py --num-records 100000 --duplicate-pct 0.2

# Small test dataset (1K records, 20% duplicates)
python3 data_generator.py --num-records 1000 --duplicate-pct 0.2

# Replace existing data without prompts
python3 data_generator.py --num-records 5000 --duplicate-pct 0.15 --force

# Add to existing data
python3 data_generator.py --num-records 2000 --duplicate-pct 0.1 --append

Data Generation Options

Option Description Example
--num-records Total number of customer records --num-records 50000
--duplicate-pct Percentage of duplicates (0.1 = 10%) --duplicate-pct 0.25
--drop Drop existing collection without prompt --drop
--append Add to existing collection --append
--force Skip all confirmations --force
--quiet Reduce logging output --quiet

Generated Data Features

  • Realistic Names - US/international names with common variations
  • Email Patterns - Multiple domains with realistic usernames
  • Phone Numbers - Formatted and normalized versions
  • Addresses - Complete postal addresses with variations
  • Controlled Duplicates - Realistic typos and variations:
    • Name variants: "Jon"/"John", "Catherine"/"Katherine"
    • Email typos: "gmail.com"/"gmial.com"
    • Phone formatting: "+1-555-123-4567" vs "(555) 123-4567"
    • Address variations: "Street"/"St", "Avenue"/"Ave"

πŸ” Atlas Search Index Setup

Step 1: Access Atlas Search

  1. Login to MongoDB Atlas at cloud.mongodb.com
  2. Select your cluster
  3. Navigate to Search tab in the cluster view
  4. Click "Create Index"

Step 2: Create Search Index

Option A: Using Atlas UI (Recommended)

  1. Choose "JSON Editor" for index configuration
  2. Set Index Name: customer_search_index (or your preferred name)
  3. Select Database: Your database name (default: dedup_demo)
  4. Select Collection: Your collection name (default: consumers)
  5. Copy and paste the following JSON configuration:
{
  "mappings": {
    "dynamic": false,
    "fields": {
      "Email": {
        "analyzer": "emailAnalyzer",
        "searchAnalyzer": "emailAnalyzer",
        "type": "string"
      },
      "FirstName": {
        "type": "string"
      },
      "FormattedPostalAddressDescription": {
        "type": "string"
      },
      "LastName": {
        "type": "string"
      },
      "NormalisedMobile": {
        "type": "string"
      },
      "NormalisedPhone": {
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "emailAnalyzer",
      "tokenFilters": [],
      "tokenizer": {
        "maxTokenLength": 200,
        "type": "uaxUrlEmail"
      }
    }
  ]
}
  1. Click "Next" and "Create Search Index"
  2. Wait for index creation (2-10 minutes depending on data size)

Option B: Using Atlas CLI

# Install Atlas CLI
curl -fLo atlas https://github.com/mongodb/mongodb-atlas-cli/releases/latest/download/atlas-linux-x86_64
chmod +x atlas

# Authenticate
atlas auth login

# Create search index
atlas clusters search indexes create \
  --clusterName <your-cluster-name> \
  --file search_index_definition.json \
  --projectId <your-project-id>

Option C: Copy from File

The complete index definition is available in search_index_definition.json. Simply copy its contents into the Atlas UI JSON editor.

Step 3: Verify Index Creation

  1. Check Index Status in Atlas Search tab - should show "Active"
  2. Test the application - duplicate detection should work immediately
  3. Monitor index size - typically 1-5% of collection size

Index Configuration Explained

  • Fields Indexed: FirstName, LastName, Email, Phone numbers, Address
  • Email Analyzer: Special tokenization for email addresses
  • Dynamic Mapping: Disabled for performance and security
  • Fuzzy Matching: Enabled through application-level fuzzy queries

Common Issues & Solutions

Issue Solution
Index creation fails Verify cluster tier (M10+ required for Atlas Search)
No search results Check collection name matches in app and index
Slow performance Ensure index status is "Active", not "Building"
Missing fields Verify field names match exactly (case-sensitive)

🌐 Access at: http://localhost:8081

πŸ—οΈ Architecture

Enterprise Structure

src/
β”œβ”€β”€ app/           # Application factory & configuration
β”œβ”€β”€ models/        # Customer & duplicate detection data models  
β”œβ”€β”€ services/      # Core business logic (search, detection, scoring)
β”œβ”€β”€ api/           # REST API endpoints
β”œβ”€β”€ web/           # Web interface routes
└── utils/         # Utilities (logging, validation, exceptions)

Key Components

  • Atlas Search Integration - Compound fuzzy queries with boost scoring
  • Similarity Engine - 160-point weighted algorithm for business confidence
  • Concurrent Processing - Parallel execution for performance optimization
  • Session Management - User settings persistence across searches
  • Field Score Configuration - Customizable weights and fuzzy match settings

πŸ” How It Works

Atlas Search Pipeline

{
  "$search": {
    "compound": {
      "should": [
        {
          "text": {
            "query": "John Smith",
            "path": "FirstName", 
            "fuzzy": {"maxEdits": 2},
            "score": {"boost": {"value": 3}}
          }
        }
        // Additional fields: LastName, Email, Phone, Address
      ]
    },
    "concurrent": true  // Configurable via settings
  }
}

Similarity Scoring

  • Names: 40 points each (exact), 20 points (partial)
  • Email: 60 points (exact), 30 points (username match)
  • Phone: 20 points (normalized digits match)
  • Address: Variable scoring based on text similarity

Confidence Levels:

  • πŸ”΄ High (>70%): Immediate merge candidates
  • 🟑 Medium (40-70%): Manual review recommended
  • 🟒 Low (<40%): Investigation needed

βš™οΈ Configuration

Access Settings page in the web interface to configure:

  • Field Scores - Boost weights and fuzzy match tolerance per field
  • Similarity Thresholds - Business confidence level boundaries
  • Search Options - Concurrent execution, result limits
  • Demo Mode - Enhanced presentation features with metrics

πŸ“Š Demo Mode Features

Perfect for customer presentations and training:

  • Real-time Performance Metrics - Search times, records examined
  • Step-by-step Process Visualization - Atlas Search execution breakdown
  • Business Impact Statistics - ROI calculations, cost savings estimates
  • Interactive Configuration - Live threshold adjustments
  • Technical Deep-dives - Expandable sections with implementation details

Enable Demo Mode: Use navigation toggle or Settings page

πŸ”Œ API Integration

Search for Duplicates

POST /api/search
Content-Type: application/json

{
  "first_name": "John",
  "last_name": "Smith",
  "email": "john.smith@example.com",
  "phone": "+1-555-123-4567"
}

Response

{
  "duplicates": [
    {
      "_id": "64a7b8c9d1e2f3a4b5c6d7e8",
      "FirstName": "Jon",
      "LastName": "Smith",
      "Email": "john.smith@example.com", 
      "similarity_score": 140,
      "search_score": 8.2,
      "confidence_level": {
        "level": "High Confidence",
        "class": "high"
      }
    }
  ],
  "count": 1
}

πŸ› οΈ Advanced Usage

Batch Processing

python batch_deduplication.py --threshold 70 --output results.json

Performance Testing

python search_query_example.py --customers 1000 --benchmark

Run Tests

python run_tests.py

🚦 Performance Benchmarks

  • Query Latency: 50-100ms (warm queries)
  • Concurrent Users: 100+ simultaneous searches
  • Accuracy: 95%+ true positive rate, <5% false positives
  • Throughput: 1000+ duplicate checks/second

🎯 Use Cases

Customer Support

  • Instant customer identification during calls
  • Consolidate duplicate support cases
  • Reduce average handle time by 15-30%

CRM Data Quality

  • Prevent duplicate lead creation
  • Maintain clean customer databases
  • Single customer view for marketing compliance

Financial Services

  • KYC compliance duplicate detection
  • Cross-product account reconciliation
  • Fraud prevention duplicate applications

πŸ”’ Production Considerations

  • Security: Environment-based secrets management, input validation
  • Monitoring: Structured logging, health checks, performance metrics
  • Scalability: Connection pooling, configurable concurrency, caching
  • Infrastructure: CDK TypeScript templates for Atlas Search deployment

πŸ“š Key Files

  • start_app.sh - Main application launcher
  • refactored_app.py - Flask application entry point
  • search_index_definition.json - Atlas Search index configuration
  • src/ - Modular enterprise architecture
  • templates/ - Modern web interface
  • docs/ - Comprehensive documentation

🀝 Support

For enterprise deployment, performance optimization, or custom development, consult your MongoDB Solutions Architect.


Powered by MongoDB Atlas Search - Intelligent duplicate detection at enterprise scale.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors