MongoDB Atlas Search Customer Deduplication Platform

🚀 Enterprise-grade intelligent duplicate detection using MongoDB Atlas Search with advanced fuzzy matching, similarity scoring, and modern web interface.

✨ Features

🔍 Advanced Duplicate Detection - Multi-field fuzzy search with configurable similarity thresholds
⚡ Real-time Performance - Sub-100ms Atlas Search queries with concurrent execution
🎯 Smart Confidence Scoring - 160-point weighted algorithm with business confidence levels
🖥️ Modern Web Interface - Professional customer support dashboard with merge workflows
🎬 Demo Mode - Interactive presentation features with performance metrics and ROI calculations
🔧 Configurable Settings - Field scores, fuzzy matching, concurrent search options
🛡️ Enterprise Ready - Structured logging, health checks, comprehensive error handling
📊 REST API - Full API with OpenAPI documentation for CRM integration

🚀 Quick Start

Prerequisites

Python 3.9+
MongoDB Atlas cluster with Atlas Search enabled
Git

Installation & Setup

# 1. Clone and setup
git clone <repository-url>
cd atlas-search-deduplication-demo

# 2. Configure environment
cp env.example .env
# Edit .env with your MongoDB Atlas connection string

# 3. Install dependencies
pip install -r requirements.txt

# 4. Verify setup
python3 -c "from dotenv import load_dotenv; load_dotenv(); import os; print('✅ Environment OK' if os.getenv('MONGODB_URI') else '❌ MONGODB_URI missing')"

# 5. Generate sample data
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15

# 6. Create Atlas Search index (see detailed instructions below)

# 7. Start application
./start_app.sh

📊 Data Generation

Generate Sample Customer Data

The application includes a sophisticated data generator that creates realistic customer records with controlled duplicates:

# Basic data generation (10K records, 15% duplicates)
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15

# Large dataset for performance testing (100K records)
python3 data_generator.py --num-records 100000 --duplicate-pct 0.2

# Small test dataset (1K records, 20% duplicates)
python3 data_generator.py --num-records 1000 --duplicate-pct 0.2

# Replace existing data without prompts
python3 data_generator.py --num-records 5000 --duplicate-pct 0.15 --force

# Add to existing data
python3 data_generator.py --num-records 2000 --duplicate-pct 0.1 --append

Data Generation Options

Option	Description	Example
`--num-records`	Total number of customer records	`--num-records 50000`
`--duplicate-pct`	Percentage of duplicates (0.1 = 10%)	`--duplicate-pct 0.25`
`--drop`	Drop existing collection without prompt	`--drop`
`--append`	Add to existing collection	`--append`
`--force`	Skip all confirmations	`--force`
`--quiet`	Reduce logging output	`--quiet`

Generated Data Features

Realistic Names - US/international names with common variations
Email Patterns - Multiple domains with realistic usernames
Phone Numbers - Formatted and normalized versions
Addresses - Complete postal addresses with variations
Controlled Duplicates - Realistic typos and variations:
- Name variants: "Jon"/"John", "Catherine"/"Katherine"
- Email typos: "gmail.com"/"gmial.com"
- Phone formatting: "+1-555-123-4567" vs "(555) 123-4567"
- Address variations: "Street"/"St", "Avenue"/"Ave"

🔍 Atlas Search Index Setup

Step 1: Access Atlas Search

Login to MongoDB Atlas at cloud.mongodb.com
Select your cluster
Navigate to Search tab in the cluster view
Click "Create Index"

Step 2: Create Search Index

Option A: Using Atlas UI (Recommended)

Choose "JSON Editor" for index configuration
Set Index Name: customer_search_index (or your preferred name)
Select Database: Your database name (default: dedup_demo)
Select Collection: Your collection name (default: consumers)
Copy and paste the following JSON configuration:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "Email": {
        "analyzer": "emailAnalyzer",
        "searchAnalyzer": "emailAnalyzer",
        "type": "string"
      },
      "FirstName": {
        "type": "string"
      },
      "FormattedPostalAddressDescription": {
        "type": "string"
      },
      "LastName": {
        "type": "string"
      },
      "NormalisedMobile": {
        "type": "string"
      },
      "NormalisedPhone": {
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "emailAnalyzer",
      "tokenFilters": [],
      "tokenizer": {
        "maxTokenLength": 200,
        "type": "uaxUrlEmail"
      }
    }
  ]
}

Click "Next" and "Create Search Index"
Wait for index creation (2-10 minutes depending on data size)

Option B: Using Atlas CLI

# Install Atlas CLI
curl -fLo atlas https://github.com/mongodb/mongodb-atlas-cli/releases/latest/download/atlas-linux-x86_64
chmod +x atlas

# Authenticate
atlas auth login

# Create search index
atlas clusters search indexes create \
  --clusterName <your-cluster-name> \
  --file search_index_definition.json \
  --projectId <your-project-id>

Option C: Copy from File

The complete index definition is available in search_index_definition.json. Simply copy its contents into the Atlas UI JSON editor.

Step 3: Verify Index Creation

Check Index Status in Atlas Search tab - should show "Active"
Test the application - duplicate detection should work immediately
Monitor index size - typically 1-5% of collection size

Index Configuration Explained

Fields Indexed: FirstName, LastName, Email, Phone numbers, Address
Email Analyzer: Special tokenization for email addresses
Dynamic Mapping: Disabled for performance and security
Fuzzy Matching: Enabled through application-level fuzzy queries

Common Issues & Solutions

Issue	Solution
Index creation fails	Verify cluster tier (M10+ required for Atlas Search)
No search results	Check collection name matches in app and index
Slow performance	Ensure index status is "Active", not "Building"
Missing fields	Verify field names match exactly (case-sensitive)

🌐 Access at: http://localhost:8081

🏗️ Architecture

Enterprise Structure

src/
├── app/           # Application factory & configuration
├── models/        # Customer & duplicate detection data models  
├── services/      # Core business logic (search, detection, scoring)
├── api/           # REST API endpoints
├── web/           # Web interface routes
└── utils/         # Utilities (logging, validation, exceptions)

Key Components

Atlas Search Integration - Compound fuzzy queries with boost scoring
Similarity Engine - 160-point weighted algorithm for business confidence
Concurrent Processing - Parallel execution for performance optimization
Session Management - User settings persistence across searches
Field Score Configuration - Customizable weights and fuzzy match settings

🔍 How It Works

Atlas Search Pipeline

{
  "$search": {
    "compound": {
      "should": [
        {
          "text": {
            "query": "John Smith",
            "path": "FirstName", 
            "fuzzy": {"maxEdits": 2},
            "score": {"boost": {"value": 3}}
          }
        }
        // Additional fields: LastName, Email, Phone, Address
      ]
    },
    "concurrent": true  // Configurable via settings
  }
}

Similarity Scoring

Names: 40 points each (exact), 20 points (partial)
Email: 60 points (exact), 30 points (username match)
Phone: 20 points (normalized digits match)
Address: Variable scoring based on text similarity

Confidence Levels:

🔴 High (>70%): Immediate merge candidates
🟡 Medium (40-70%): Manual review recommended
🟢 Low (<40%): Investigation needed

⚙️ Configuration

Access Settings page in the web interface to configure:

Field Scores - Boost weights and fuzzy match tolerance per field
Similarity Thresholds - Business confidence level boundaries
Search Options - Concurrent execution, result limits
Demo Mode - Enhanced presentation features with metrics

📊 Demo Mode Features

Perfect for customer presentations and training:

Real-time Performance Metrics - Search times, records examined
Step-by-step Process Visualization - Atlas Search execution breakdown
Business Impact Statistics - ROI calculations, cost savings estimates
Interactive Configuration - Live threshold adjustments
Technical Deep-dives - Expandable sections with implementation details

Enable Demo Mode: Use navigation toggle or Settings page

🔌 API Integration

Search for Duplicates

POST /api/search
Content-Type: application/json

{
  "first_name": "John",
  "last_name": "Smith",
  "email": "john.smith@example.com",
  "phone": "+1-555-123-4567"
}

Response

{
  "duplicates": [
    {
      "_id": "64a7b8c9d1e2f3a4b5c6d7e8",
      "FirstName": "Jon",
      "LastName": "Smith",
      "Email": "john.smith@example.com", 
      "similarity_score": 140,
      "search_score": 8.2,
      "confidence_level": {
        "level": "High Confidence",
        "class": "high"
      }
    }
  ],
  "count": 1
}

🛠️ Advanced Usage

Batch Processing

python batch_deduplication.py --threshold 70 --output results.json

Performance Testing

python search_query_example.py --customers 1000 --benchmark

Run Tests

python run_tests.py

🚦 Performance Benchmarks

Query Latency: 50-100ms (warm queries)
Concurrent Users: 100+ simultaneous searches
Accuracy: 95%+ true positive rate, <5% false positives
Throughput: 1000+ duplicate checks/second

🎯 Use Cases

Customer Support

Instant customer identification during calls
Consolidate duplicate support cases
Reduce average handle time by 15-30%

CRM Data Quality

Prevent duplicate lead creation
Maintain clean customer databases
Single customer view for marketing compliance

Financial Services

KYC compliance duplicate detection
Cross-product account reconciliation
Fraud prevention duplicate applications

🔒 Production Considerations

Security: Environment-based secrets management, input validation
Monitoring: Structured logging, health checks, performance metrics
Scalability: Connection pooling, configurable concurrency, caching
Infrastructure: CDK TypeScript templates for Atlas Search deployment

📚 Key Files

start_app.sh - Main application launcher
refactored_app.py - Flask application entry point
search_index_definition.json - Atlas Search index configuration
src/ - Modular enterprise architecture
templates/ - Modern web interface
docs/ - Comprehensive documentation

🤝 Support

For enterprise deployment, performance optimization, or custom development, consult your MongoDB Solutions Architect.

Powered by MongoDB Atlas Search - Intelligent duplicate detection at enterprise scale.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
cdk		cdk
src		src
static/css		static/css
templates		templates
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ATLAS_SEARCH_ARCHITECTURE.md		ATLAS_SEARCH_ARCHITECTURE.md
Electrolux-Logo.png		Electrolux-Logo.png
PERFORMANCE_IMPROVEMENTS.md		PERFORMANCE_IMPROVEMENTS.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TEST_RESULTS.md		TEST_RESULTS.md
batch_deduplication.py		batch_deduplication.py
data_generator.py		data_generator.py
deduplication.log		deduplication.log
env.example		env.example
optimize_performance.py		optimize_performance.py
refactored_app.py		refactored_app.py
requirements.txt		requirements.txt
run_tests.py		run_tests.py
search_index_definition.json		search_index_definition.json
search_query_example.py		search_query_example.py
setup_env.sh		setup_env.sh
start_app.sh		start_app.sh
test_scoring.py		test_scoring.py

Folders and files

Latest commit

History

Repository files navigation

MongoDB Atlas Search Customer Deduplication Platform

✨ Features

🚀 Quick Start

Prerequisites

Installation & Setup

📊 Data Generation

Generate Sample Customer Data

Data Generation Options

Generated Data Features

🔍 Atlas Search Index Setup

Step 1: Access Atlas Search

Step 2: Create Search Index

Option A: Using Atlas UI (Recommended)

Option B: Using Atlas CLI

Option C: Copy from File

Step 3: Verify Index Creation

Index Configuration Explained

Common Issues & Solutions

🏗️ Architecture

Enterprise Structure

Key Components

🔍 How It Works

Atlas Search Pipeline

Similarity Scoring

⚙️ Configuration

📊 Demo Mode Features

🔌 API Integration

Search for Duplicates

Response

🛠️ Advanced Usage

Batch Processing

Performance Testing

Run Tests

🚦 Performance Benchmarks

🎯 Use Cases

Customer Support

CRM Data Quality

Financial Services

🔒 Production Considerations

📚 Key Files

🤝 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages