π Enterprise-grade intelligent duplicate detection using MongoDB Atlas Search with advanced fuzzy matching, similarity scoring, and modern web interface.
- π Advanced Duplicate Detection - Multi-field fuzzy search with configurable similarity thresholds
- β‘ Real-time Performance - Sub-100ms Atlas Search queries with concurrent execution
- π― Smart Confidence Scoring - 160-point weighted algorithm with business confidence levels
- π₯οΈ Modern Web Interface - Professional customer support dashboard with merge workflows
- π¬ Demo Mode - Interactive presentation features with performance metrics and ROI calculations
- π§ Configurable Settings - Field scores, fuzzy matching, concurrent search options
- π‘οΈ Enterprise Ready - Structured logging, health checks, comprehensive error handling
- π REST API - Full API with OpenAPI documentation for CRM integration
- Python 3.9+
- MongoDB Atlas cluster with Atlas Search enabled
- Git
# 1. Clone and setup
git clone <repository-url>
cd atlas-search-deduplication-demo
# 2. Configure environment
cp env.example .env
# Edit .env with your MongoDB Atlas connection string
# 3. Install dependencies
pip install -r requirements.txt
# 4. Verify setup
python3 -c "from dotenv import load_dotenv; load_dotenv(); import os; print('β
Environment OK' if os.getenv('MONGODB_URI') else 'β MONGODB_URI missing')"
# 5. Generate sample data
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15
# 6. Create Atlas Search index (see detailed instructions below)
# 7. Start application
./start_app.shThe application includes a sophisticated data generator that creates realistic customer records with controlled duplicates:
# Basic data generation (10K records, 15% duplicates)
python3 data_generator.py --num-records 10000 --duplicate-pct 0.15
# Large dataset for performance testing (100K records)
python3 data_generator.py --num-records 100000 --duplicate-pct 0.2
# Small test dataset (1K records, 20% duplicates)
python3 data_generator.py --num-records 1000 --duplicate-pct 0.2
# Replace existing data without prompts
python3 data_generator.py --num-records 5000 --duplicate-pct 0.15 --force
# Add to existing data
python3 data_generator.py --num-records 2000 --duplicate-pct 0.1 --append| Option | Description | Example |
|---|---|---|
--num-records |
Total number of customer records | --num-records 50000 |
--duplicate-pct |
Percentage of duplicates (0.1 = 10%) | --duplicate-pct 0.25 |
--drop |
Drop existing collection without prompt | --drop |
--append |
Add to existing collection | --append |
--force |
Skip all confirmations | --force |
--quiet |
Reduce logging output | --quiet |
- Realistic Names - US/international names with common variations
- Email Patterns - Multiple domains with realistic usernames
- Phone Numbers - Formatted and normalized versions
- Addresses - Complete postal addresses with variations
- Controlled Duplicates - Realistic typos and variations:
- Name variants: "Jon"/"John", "Catherine"/"Katherine"
- Email typos: "gmail.com"/"gmial.com"
- Phone formatting: "+1-555-123-4567" vs "(555) 123-4567"
- Address variations: "Street"/"St", "Avenue"/"Ave"
- Login to MongoDB Atlas at cloud.mongodb.com
- Select your cluster
- Navigate to Search tab in the cluster view
- Click "Create Index"
- Choose "JSON Editor" for index configuration
- Set Index Name:
customer_search_index(or your preferred name) - Select Database: Your database name (default:
dedup_demo) - Select Collection: Your collection name (default:
consumers) - Copy and paste the following JSON configuration:
{
"mappings": {
"dynamic": false,
"fields": {
"Email": {
"analyzer": "emailAnalyzer",
"searchAnalyzer": "emailAnalyzer",
"type": "string"
},
"FirstName": {
"type": "string"
},
"FormattedPostalAddressDescription": {
"type": "string"
},
"LastName": {
"type": "string"
},
"NormalisedMobile": {
"type": "string"
},
"NormalisedPhone": {
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "emailAnalyzer",
"tokenFilters": [],
"tokenizer": {
"maxTokenLength": 200,
"type": "uaxUrlEmail"
}
}
]
}- Click "Next" and "Create Search Index"
- Wait for index creation (2-10 minutes depending on data size)
# Install Atlas CLI
curl -fLo atlas https://github.com/mongodb/mongodb-atlas-cli/releases/latest/download/atlas-linux-x86_64
chmod +x atlas
# Authenticate
atlas auth login
# Create search index
atlas clusters search indexes create \
--clusterName <your-cluster-name> \
--file search_index_definition.json \
--projectId <your-project-id>The complete index definition is available in search_index_definition.json. Simply copy its contents into the Atlas UI JSON editor.
- Check Index Status in Atlas Search tab - should show "Active"
- Test the application - duplicate detection should work immediately
- Monitor index size - typically 1-5% of collection size
- Fields Indexed: FirstName, LastName, Email, Phone numbers, Address
- Email Analyzer: Special tokenization for email addresses
- Dynamic Mapping: Disabled for performance and security
- Fuzzy Matching: Enabled through application-level fuzzy queries
| Issue | Solution |
|---|---|
| Index creation fails | Verify cluster tier (M10+ required for Atlas Search) |
| No search results | Check collection name matches in app and index |
| Slow performance | Ensure index status is "Active", not "Building" |
| Missing fields | Verify field names match exactly (case-sensitive) |
π Access at: http://localhost:8081
src/
βββ app/ # Application factory & configuration
βββ models/ # Customer & duplicate detection data models
βββ services/ # Core business logic (search, detection, scoring)
βββ api/ # REST API endpoints
βββ web/ # Web interface routes
βββ utils/ # Utilities (logging, validation, exceptions)
- Atlas Search Integration - Compound fuzzy queries with boost scoring
- Similarity Engine - 160-point weighted algorithm for business confidence
- Concurrent Processing - Parallel execution for performance optimization
- Session Management - User settings persistence across searches
- Field Score Configuration - Customizable weights and fuzzy match settings
{
"$search": {
"compound": {
"should": [
{
"text": {
"query": "John Smith",
"path": "FirstName",
"fuzzy": {"maxEdits": 2},
"score": {"boost": {"value": 3}}
}
}
// Additional fields: LastName, Email, Phone, Address
]
},
"concurrent": true // Configurable via settings
}
}- Names: 40 points each (exact), 20 points (partial)
- Email: 60 points (exact), 30 points (username match)
- Phone: 20 points (normalized digits match)
- Address: Variable scoring based on text similarity
Confidence Levels:
- π΄ High (>70%): Immediate merge candidates
- π‘ Medium (40-70%): Manual review recommended
- π’ Low (<40%): Investigation needed
Access Settings page in the web interface to configure:
- Field Scores - Boost weights and fuzzy match tolerance per field
- Similarity Thresholds - Business confidence level boundaries
- Search Options - Concurrent execution, result limits
- Demo Mode - Enhanced presentation features with metrics
Perfect for customer presentations and training:
- Real-time Performance Metrics - Search times, records examined
- Step-by-step Process Visualization - Atlas Search execution breakdown
- Business Impact Statistics - ROI calculations, cost savings estimates
- Interactive Configuration - Live threshold adjustments
- Technical Deep-dives - Expandable sections with implementation details
Enable Demo Mode: Use navigation toggle or Settings page
POST /api/search
Content-Type: application/json
{
"first_name": "John",
"last_name": "Smith",
"email": "john.smith@example.com",
"phone": "+1-555-123-4567"
}{
"duplicates": [
{
"_id": "64a7b8c9d1e2f3a4b5c6d7e8",
"FirstName": "Jon",
"LastName": "Smith",
"Email": "john.smith@example.com",
"similarity_score": 140,
"search_score": 8.2,
"confidence_level": {
"level": "High Confidence",
"class": "high"
}
}
],
"count": 1
}python batch_deduplication.py --threshold 70 --output results.jsonpython search_query_example.py --customers 1000 --benchmarkpython run_tests.py- Query Latency: 50-100ms (warm queries)
- Concurrent Users: 100+ simultaneous searches
- Accuracy: 95%+ true positive rate, <5% false positives
- Throughput: 1000+ duplicate checks/second
- Instant customer identification during calls
- Consolidate duplicate support cases
- Reduce average handle time by 15-30%
- Prevent duplicate lead creation
- Maintain clean customer databases
- Single customer view for marketing compliance
- KYC compliance duplicate detection
- Cross-product account reconciliation
- Fraud prevention duplicate applications
- Security: Environment-based secrets management, input validation
- Monitoring: Structured logging, health checks, performance metrics
- Scalability: Connection pooling, configurable concurrency, caching
- Infrastructure: CDK TypeScript templates for Atlas Search deployment
start_app.sh- Main application launcherrefactored_app.py- Flask application entry pointsearch_index_definition.json- Atlas Search index configurationsrc/- Modular enterprise architecturetemplates/- Modern web interfacedocs/- Comprehensive documentation
For enterprise deployment, performance optimization, or custom development, consult your MongoDB Solutions Architect.
Powered by MongoDB Atlas Search - Intelligent duplicate detection at enterprise scale.