Date: October 28, 2025 Status: Integration Complete - Ready to Use
The mapping enhancement workflow has been fully integrated into your Makefile pipeline. The enhancements now run automatically as part of the main pipeline.
Added between Stage 10 (kg-merge-mappings) and Stage 11 (compute-properties):
Stage 10: Merge mappings → high_confidence_compound_mappings.tsv (56% coverage)
↓
Stage 10.5: Enhancement Pipeline (NEW!)
10.5a → CAS-to-ChEBI upgrade → high_confidence_compound_mappings_upgraded.tsv
10.5b → Formula matching → high_confidence_compound_mappings_formula_enhanced.tsv
10.5c → Microbio products → high_confidence_compound_mappings_final.tsv (72% coverage)
↓
Stage 11: Property calculation (now uses 72% coverage file)
Stage 12: Media summary (now uses 72% coverage file)
make allThis now automatically runs all enhancement stages and uses the improved 72% coverage mappings.
If you already have high_confidence_compound_mappings.tsv and want to enhance it:
make kg-enhance-allThis runs all 3 enhancement stages in sequence.
# Run CAS-to-ChEBI upgrade only
make kg-enhance-cas-upgrade
# Run formula matching only
make kg-enhance-formula-matching
# Run microbiology products only
make kg-enhance-microbio-productsmake statusShows which enhancement files have been created:
Enhanced Mapping Files (Stage 10.5):
✓ CAS upgraded: 17658 entries
✓ Formula enhanced: 17658 entries
✓ Final enhanced (72% coverage): 17658 entries, 754/1047 unique compounds
Property Calculation (make compute-properties):
- Now uses:
high_confidence_compound_mappings_final.tsv - Previously used:
high_confidence_compound_mappings.tsv - Benefit: More compounds have ChEBI IDs → better pKa lookups → more accurate pH calculations
Media Summary (make media-summary):
- Now uses:
high_confidence_compound_mappings_final.tsv - Previously used:
high_confidence_compound_mappings.tsv - Benefit: Richer semantic annotations in final summary
When you run make kg-enhance-all (or make all):
-
CAS-to-ChEBI Upgrade (~30 seconds):
- Processes: 3,328 CAS-RN entries
- Upgrades: 1,628 to ChEBI (48.9% success rate)
- Coverage: 56% → 65% (+9%)
-
Formula Matching (~1 minute):
- Processes: 1,272 compounds
- Matches: 936 hydrated formulas (73.6% success rate)
- Coverage: 65% → 70% (+5%)
-
Microbiology Products (~5 seconds):
- Processes: 1,205 compounds
- Matches: 178 biological products
- Coverage: 70% → 72% (+2%)
Total Time: ~2 minutes Total Improvement: +16% coverage (56% → 72%)
All enhanced mapping files are created in pipeline_output/merge_mappings/:
high_confidence_compound_mappings.tsv # Original (56% coverage)
high_confidence_compound_mappings_upgraded.tsv # After CAS upgrade (65%)
high_confidence_compound_mappings_formula_enhanced.tsv # After formula matching (70%)
high_confidence_compound_mappings_final.tsv # After all enhancements (72%) ← USED BY PIPELINE
# Check files exist
ls -lh pipeline_output/merge_mappings/high_confidence_compound_mappings_*.tsv
# Check coverage
make status | grep "Enhanced Mapping"
# Count ChEBI entries in final file
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv | \
sort -u | wc -l
# Should show: 754 unique compounds# Original file ChEBI count
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
pipeline_output/merge_mappings/high_confidence_compound_mappings.tsv | \
sort -u | wc -l
# Shows: 587 (56%)
# Enhanced file ChEBI count
awk -F'\t' 'NR>1 && $3 ~ /^CHEBI:/ {print $2}' \
pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv | \
sort -u | wc -l
# Shows: 754 (72%)
# Improvement: +167 compounds with ChEBI IDsThe ChEBI nodes file path is configured in the Makefile:
CHEBI_NODES_FILE := /Users/marcin/Documents/VIMSS/ontology/KG-Hub/KG-Microbe/kg-microbe/data/transformed/ontologies/chebi_nodes.tsvIf your ChEBI file is in a different location, edit this line in the Makefile.
If make kg-enhance-all says files are up-to-date:
# Force re-run by removing intermediate files
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_upgraded.tsv
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_formula_enhanced.tsv
rm pipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv
# Run again
make kg-enhance-allIf you get an error about chebi_nodes.tsv not found:
- Update the
CHEBI_NODES_FILEvariable in Makefile - Or download ChEBI: https://www.ebi.ac.uk/chebi/downloadsForward.do
If property calculation still uses old file:
# Remove property calculation cache
rm -rf pipeline_output/property_calculation/media_properties/.done
# Re-run with enhanced mappings
make compute-propertiesFor full list of targets:
make helpLook for:
- Step 10.5: Enhancement stages
- Step 11: Property calculation (now with 72% coverage note)
- Step 12: Media summary (now with 72% coverage note)
✅ Integration Complete
✅ Enhancements Run Automatically with make all
✅ 72% ChEBI Coverage achieved
✅ Property Calculations now use enhanced mappings
✅ Media Summary now uses enhanced mappings
Next Step: Run make kg-enhance-all or make all to use the improvements!
Documentation:
- Full deployment details:
DEPLOYMENT_REPORT.md - Implementation guide:
IMPLEMENTATION_SUMMARY.md - Project summary:
FINAL_SUMMARY.md - Architecture:
CLAUDE.md