Skip to content

Latest commit

 

History

History
251 lines (172 loc) · 6.33 KB

File metadata and controls

251 lines (172 loc) · 6.33 KB

✅ Mapping Enhancements Successfully Integrated!

Date: October 28, 2025 Status: Integration Complete - Ready to Use


What Changed

The mapping enhancement workflow has been fully integrated into your Makefile pipeline. The enhancements now run automatically as part of the main pipeline.

New Pipeline Stage: 10.5

Added between Stage 10 (kg-merge-mappings) and Stage 11 (compute-properties):

Stage 10:   Merge mappings → high_confidence_compound_mappings.tsv (56% coverage)
           ↓
Stage 10.5: Enhancement Pipeline (NEW!)
   10.5a → CAS-to-ChEBI upgrade        → high_confidence_compound_mappings_upgraded.tsv
   10.5b → Formula matching            → high_confidence_compound_mappings_formula_enhanced.tsv
   10.5c → Microbio products          → high_confidence_compound_mappings_final.tsv (72% coverage)
           ↓
Stage 11:   Property calculation (now uses 72% coverage file)
Stage 12:   Media summary (now uses 72% coverage file)

How to Use

Option 1: Run Complete Pipeline (Recommended)

make all

This now automatically runs all enhancement stages and uses the improved 72% coverage mappings.

Option 2: Run Just Enhancements

If you already have high_confidence_compound_mappings.tsv and want to enhance it:

make kg-enhance-all

This runs all 3 enhancement stages in sequence.

Option 3: Run Individual Enhancement Stages

# Run CAS-to-ChEBI upgrade only
make kg-enhance-cas-upgrade

# Run formula matching only
make kg-enhance-formula-matching

# Run microbiology products only
make kg-enhance-microbio-products

Option 4: Check Status

make status

Shows which enhancement files have been created:

Enhanced Mapping Files (Stage 10.5):
✓ CAS upgraded: 17658 entries
✓ Formula enhanced: 17658 entries
✓ Final enhanced (72% coverage): 17658 entries, 754/1047 unique compounds

What Gets Enhanced

Downstream Stages Now Use 72% Coverage

Property Calculation (make compute-properties):

  • Now uses: high_confidence_compound_mappings_final.tsv
  • Previously used: high_confidence_compound_mappings.tsv
  • Benefit: More compounds have ChEBI IDs → better pKa lookups → more accurate pH calculations

Media Summary (make media-summary):

  • Now uses: high_confidence_compound_mappings_final.tsv
  • Previously used: high_confidence_compound_mappings.tsv
  • Benefit: Richer semantic annotations in final summary

Expected Results

When you run make kg-enhance-all (or make all):

  1. CAS-to-ChEBI Upgrade (~30 seconds):

    • Processes: 3,328 CAS-RN entries
    • Upgrades: 1,628 to ChEBI (48.9% success rate)
    • Coverage: 56% → 65% (+9%)
  2. Formula Matching (~1 minute):

    • Processes: 1,272 compounds
    • Matches: 936 hydrated formulas (73.6% success rate)
    • Coverage: 65% → 70% (+5%)
  3. Microbiology Products (~5 seconds):

    • Processes: 1,205 compounds
    • Matches: 178 biological products
    • Coverage: 70% → 72% (+2%)

Total Time: ~2 minutes Total Improvement: +16% coverage (56% → 72%)


Output Files

All enhanced mapping files are created in pipeline_output/merge_mappings/:

high_confidence_compound_mappings.tsv              # Original (56% coverage)
high_confidence_compound_mappings_upgraded.tsv     # After CAS upgrade (65%)
high_confidence_compound_mappings_formula_enhanced.tsv  # After formula matching (70%)
high_confidence_compound_mappings_final.tsv        # After all enhancements (72%) ← USED BY PIPELINE

Verification

Check Enhancement Ran Successfully

# Check files exist
ls -lh pipeline_output/merge_mappings/high_confidence_compound_mappings_*.tsv

# Check coverage
make status | grep "Enhanced Mapping"

# Count ChEBI entries in final file
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
  pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv | \
  sort -u | wc -l
# Should show: 754 unique compounds

Compare Before/After

# Original file ChEBI count
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
  pipeline_output/merge_mappings/high_confidence_compound_mappings.tsv | \
  sort -u | wc -l
# Shows: 587 (56%)

# Enhanced file ChEBI count
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
  pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv | \
  sort -u | wc -l
# Shows: 754 (72%)

# Improvement: +167 compounds with ChEBI IDs

Configuration

The ChEBI nodes file path is configured in the Makefile:

CHEBI_NODES_FILE := /Users/marcin/Documents/VIMSS/ontology/KG-Hub/KG-Microbe/kg-microbe/data/transformed/ontologies/chebi_nodes.tsv

If your ChEBI file is in a different location, edit this line in the Makefile.


Troubleshooting

Enhancement Stages Skip

If make kg-enhance-all says files are up-to-date:

# Force re-run by removing intermediate files
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_upgraded.tsv
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_formula_enhanced.tsv
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv

# Run again
make kg-enhance-all

ChEBI File Not Found

If you get an error about chebi_nodes.tsv not found:

  1. Update the CHEBI_NODES_FILE variable in Makefile
  2. Or download ChEBI: https://www.ebi.ac.uk/chebi/downloadsForward.do

Enhancement Not Used by Pipeline

If property calculation still uses old file:

# Remove property calculation cache
rm -rf pipeline_output/property_calculation/media_properties/.done

# Re-run with enhanced mappings
make compute-properties

Help

For full list of targets:

make help

Look for:

  • Step 10.5: Enhancement stages
  • Step 11: Property calculation (now with 72% coverage note)
  • Step 12: Media summary (now with 72% coverage note)

Summary

Integration CompleteEnhancements Run Automatically with make all72% ChEBI Coverage achieved ✅ Property Calculations now use enhanced mappings ✅ Media Summary now uses enhanced mappings

Next Step: Run make kg-enhance-all or make all to use the improvements!


Documentation:

  • Full deployment details: DEPLOYMENT_REPORT.md
  • Implementation guide: IMPLEMENTATION_SUMMARY.md
  • Project summary: FINAL_SUMMARY.md
  • Architecture: CLAUDE.md