Status: ✅ Infrastructure complete | ⏳ Baseline merge running Last Updated: March 14, 2026
-
Virtual Environment: Activate the project venv
source .venv/bin/activate -
MediaIngredientMech (for hierarchy features): Clone the repository
# Outside this project git clone https://github.com/microbiomedata/MediaIngredientMech.git
Running now: Processing 15,431 recipes → data/merge_yaml/merged_2026/
Previous baseline (Feb 2026):
- Input: 10,595 recipes
- Output: 1,350 merged recipes
- Reduction: 87.3%
Expected new baseline:
- Input: 15,431 recipes
- Output: ~2,000 merged recipes (estimate)
- Reduction: ~87% (estimate)
All core scripts and modules are ready to use:
- ✅ Hierarchy-aware fingerprinting
- ✅ Merge rule engine
- ✅ Quality validation
- ✅ Rollback capability
- ✅ Monitoring dashboard
# Requires MediaIngredientMech repository
source .venv/bin/activate
python scripts/compare_fingerprints.py \
--mim-repo /path/to/MediaIngredientMech \
--output reports/fingerprint_comparison.yaml \
--limit 5000 # Optional: test on subset first
# View results
cat reports/fingerprint_comparison.yamlOutput: Shows how many recipes would merge differently in chemical vs variant vs original modes.
source .venv/bin/activate
python scripts/test_merge_modes.py \
--mim-repo /path/to/MediaIngredientMech \
--modes conservative,aggressive,variant-aware \
--fingerprint-mode chemical \
--output reports/mode_comparison.yaml
# View results
cat reports/mode_comparison.yamlOutput: Statistics for each merge mode (group counts, reduction %, confidence scores).
source .venv/bin/activate
# Validate current baseline merge
python scripts/validate_merge_quality.py \
--merged-dir data/merge_yaml/merged_2026 \
--mim-repo /path/to/MediaIngredientMech \
--output reports/merge_quality.yaml
# View quality report
cat reports/merge_quality.yamlChecks:
- ❌ Variant contamination (e.g., CaCl₂·2H₂O merged with CaCl₂)
⚠️ Parent mismatches (conflicting hierarchy relationships)- ℹ️ Concentration outliers (wildly different amounts)
source .venv/bin/activate
# Preview what would be undone (dry run)
python scripts/undo_merge.py \
--merged-dir data/merge_yaml/merged_2026 \
--normalized-dir data/normalized_yaml \
--filter variant_contamination \
--quality-report reports/merge_quality.yaml \
--dry-run
# Actually undo (remove --dry-run)
python scripts/undo_merge.py \
--merged-dir data/merge_yaml/merged_2026 \
--normalized-dir data/normalized_yaml \
--filter variant_contamination \
--quality-report reports/merge_quality.yaml \
--output-dir data/merge_yaml/restoredsource .venv/bin/activate
python scripts/monitor_merges.py \
--merged-dir data/merge_yaml/merged_2026 \
--quality-report reports/merge_quality.yaml \
--output reports/merge_dashboard.yaml
# View dashboard
cat reports/merge_dashboard.yamlOutput: Comprehensive quality dashboard with actionable recommendations.
data/normalized_yaml/- 15,431 normalized recipes (organized by category)
data/merge_yaml/merged_2026/- New baseline merge (running now)data/merge_yaml/merged/- Previous merge (Feb 2026, 1,350 recipes)data/merge_yaml/merge_stats_2026.json- New baseline statistics
reports/fingerprint_comparison.yaml- Fingerprint mode analysisreports/mode_comparison.yaml- Merge mode comparisonreports/merge_quality.yaml- Quality validation resultsreports/merge_dashboard.yaml- Monitoring dashboard
src/culturemech/merge/hierarchy_fingerprint.py- Hierarchy-aware fingerprintingsrc/culturemech/merge/merge_rules.py- Merge rule enginesrc/culturemech/merge/fingerprint.py- Original fingerprinter (baseline)src/culturemech/merge/merger.py- Recipe mergersrc/culturemech/merge/merge_recipes.py- Main CLI script
All scripts support --limit to test on smaller datasets:
# Test fingerprint comparison on 1000 recipes
python scripts/compare_fingerprints.py \
--mim-repo /path/to/MediaIngredientMech \
--limit 1000
# Test merge modes on 1000 recipes
python scripts/test_merge_modes.py \
--mim-repo /path/to/MediaIngredientMech \
--modes all \
--limit 1000
# Test quality validation on 500 recipes
python scripts/validate_merge_quality.py \
--merged-dir data/merge_yaml/merged_2026 \
--mim-repo /path/to/MediaIngredientMech \
--limit 500The baseline merge is running in the background. To check status:
# Check if output directory exists
ls -la data/merge_yaml/merged_2026/
# Count recipes processed so far
find data/merge_yaml/merged_2026 -name "*.yaml" | wc -l
# Check statistics file
cat data/merge_yaml/merge_stats_2026.json 2>/dev/null || echo "Not ready yet"Expected completion: 2-5 minutes for 15,431 recipes
- Strategy: Only merge with explicit rules or exact fingerprint match
- Use case: Maximum safety, preserve all distinctions
- Expected reduction: ~85%
- Strategy: Merge all variants with same parent ingredient
- Use case: Maximum deduplication, ignore variant differences
- Expected reduction: ~92%
- Strategy: Merge hydration variants only (CaCl₂·2H₂O = CaCl₂), preserve other distinctions
- Use case: Balanced approach - good deduplication while preserving important chemistry
- Expected reduction: ~88%
Step 1: Wait for baseline merge to complete
# Check if complete
ls data/merge_yaml/merge_stats_2026.jsonStep 2: Validate baseline quality
source .venv/bin/activate
python scripts/verify_merges.py \
--normalized-dir data/normalized_yaml \
--merged-dir data/merge_yaml/merged_2026 \
--stats-file data/merge_yaml/merge_stats_2026.jsonStep 3: Test hierarchy-aware features (requires MediaIngredientMech)
# Compare fingerprint modes
python scripts/compare_fingerprints.py \
--mim-repo /path/to/MediaIngredientMech \
--limit 5000
# Test merge modes
python scripts/test_merge_modes.py \
--mim-repo /path/to/MediaIngredientMech \
--modes all \
--limit 5000
# Validate quality
python scripts/validate_merge_quality.py \
--merged-dir data/merge_yaml/merged_2026 \
--mim-repo /path/to/MediaIngredientMechStep 4: Review results and choose mode
# Compare reports
ls -lh reports/
# View key metrics
cat reports/mode_comparison.yaml | grep -A5 "summary"
cat reports/merge_quality.yaml | grep -A5 "summary"Step 5: Deploy chosen mode (see full implementation guide)
Solution: Activate virtual environment first
source .venv/bin/activateSolution: Clone the repository
git clone https://github.com/microbiomedata/MediaIngredientMech.git
# Then use --mim-repo /path/to/MediaIngredientMechSolution: Use --limit to test on smaller subsets
python scripts/compare_fingerprints.py --limit 1000 ...Solution: Use rollback script with dry-run first
python scripts/undo_merge.py --dry-run --recipe-id "CM:123456" ...- Full Implementation Guide:
HIERARCHY_MERGE_IMPLEMENTATION.md - Hierarchy Integration:
HIERARCHY_INTEGRATION_SUMMARY.md - Original Plan: See plan mode transcript
- Script Help: Run any script with
--help
- ✅ Wait for baseline merge to complete (~2-5 min)
- ⏳ Validate baseline with verify_merges.py
- ⏳ Get MediaIngredientMech repository
- ⏳ Test fingerprint modes (requires MediaIngredientMech)
- ⏳ Compare merge modes (requires MediaIngredientMech)
- ⏳ Choose production mode based on results
Status check:
# Is baseline merge complete?
ls -lh data/merge_yaml/merge_stats_2026.json
# How many recipes merged?
find data/merge_yaml/merged_2026 -name "*.yaml" | wc -lQuestions? Check the full implementation guide or script --help output.