This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
MicroMediaParam is a bioinformatics pipeline for extracting, processing, and analyzing microbial growth media composition data from BacDive/MediaDive and DSMZ sources. It maps chemical compounds to knowledge graph entities (ChEBI, KEGG, PubChem) and computes physical-chemical properties (pH, salinity).
Dataset: 23,181 chemical entries from 1,807 microbial growth media with 72% ChEBI coverage.
# Environment (prefer uv)
uv sync # Install dependencies
uv pip install -e ".[dev]" # Install dev dependencies
# Code quality
black src/ *.py && isort src/ *.py # Format
flake8 src/ *.py && mypy src/ # Lint and type check
# Testing
make test # Full test suite
python -m pytest -v # Pytest only
# Pipeline execution
make all # Complete pipeline (30-60 min)
make status # Check pipeline status
make validate # Verify outputsThe Makefile orchestrates a 12-stage pipeline. Run make help for all targets.
| Stage | Target | Description |
|---|---|---|
| 1 | data-acquisition |
Download PDFs/JSON from MediaDive/DSMZ |
| 2 | data-conversion |
Convert PDFs to markdown, extract compositions |
| 3 | db-mapping |
Build chemical properties DB (ingredient → pKa) |
| 4 | kg-mapping-initial |
Initial ChEBI/KEGG/PubChem mapping |
| 5 | solution-expansion |
Expand "solution:241" → individual chemicals |
| 6 | normalize-hydration-early |
Normalize hydrates BEFORE matching |
| 7 | enhance-ingredients-early |
Convert ingredient codes → ChEBI |
| 8-10 | kg-compound-matching |
OAK ChEBI + fuzzy matching + merge |
| 10.5 | kg-enhance-all |
CAS→ChEBI + formula + microbio products (+16%) |
| 11 | compute-properties |
Calculate pH, salinity, ionic strength |
| 12 | media-summary |
Generate final summary table |
- DB Mapping (ingredient → pKa, properties): For pH/salinity calculations
- KG Mapping (ingredient → ChEBI IDs): For semantic analysis
src/
├── scripts/ # Main pipeline scripts (parse, download, convert, map, compute)
├── mapping/ # KG mapping: unified_mapper.py, formula_matcher.py, cas_to_chebi_upgrader.py
├── hydration/ # Hydrate normalization: normalize_hydration_enhanced.py
├── analysis/ # Chemical analysis: extract_non_chebi_compounds.py
├── quality/ # Validation: calculate_molecular_weights.py, fix_*.py
├── tools/ # Utilities: complete_solution_expansion.py
├── chem/ # IUPAC and PubChem integration
└── attic/ # Legacy scripts (archived)
pipeline_output/kg_mapping/composition_kg_mapping.tsv- Initial mappingspipeline_output/solution_expansion/composition_kg_mapping_expanded_solutions.tsv- After solution expansionpipeline_output/merge_mappings/high_confidence_compound_mappings.tsv- Final high-quality mappingspipeline_output/merge_mappings/high_confidence_compound_mappings_final.tsv- After all enhancements (72% ChEBI)pipeline_output/db_mapping/chemical_properties.tsv- pKa and molecular properties
Hydration normalization happens EARLY (Stage 6) to ensure consistent base compounds:
- "CaCl2 x 2 H2O" and "CaCl2 x 6 H2O" → same base ChEBI, different molecular weights
- Patterns:
6-hydrate,6H2O,x H2O,·6H2O - MW calculation:
MW_hydrated = MW_base + (n × 18.015)
Solution references (e.g., "solution:241") are expanded into individual chemicals by:
- Downloading solution PDFs from DSMZ MediaDive REST API
- Parsing compositions with
src/tools/enhanced_solution_parser.py - Adjusting concentrations based on solution usage ratios
Three strategies achieve +16% coverage gain (56% → 72%):
cas_to_chebi_upgrader.py- CAS-RN → ChEBI (+94 compounds)apply_formula_matching.py- Hydrated formulas (+56 compounds)apply_microbio_products.py- Biological products (+21 semantic IDs)
Core components in src/mapping/:
compound_normalizer.py- Name normalization (hydrates, stereochemistry)matching_strategies.py- Strategy pattern for different approachesunified_mapper.py- Main mapping engineformula_matcher.py- Hydrated chemical formulasmicrobio_products.py- Curated biological products dictionary
Additional unmapped chemicals from data/unmapped/bacdive_metabolites_without_chebi_ids.tsv:
- 19,129 records → 154 unique metabolites (optimized for processing)
- Top metabolites: Potassium 5-ketogluconate (7,610), Potassium 2-ketogluconate (6,705), casein (1,585)
- Pipeline:
make bacdive-metabolites-mapping(extract → OAK annotate → apply) - Output:
pipeline_output/bacdive_metabolites/bacdive_metabolites_chebi_mappings.tsv
# Debug a specific stage
make status # See what's present
make solution-expansion # Run just that stage
cat *.log # Check logs
# Work with OAK ChEBI mappings
make oak-chebi-status # Check status
make kg-oak-chebi-mapping # Run full OAK pipeline
# Map BacDive metabolites (19k records → 154 unique)
make bacdive-metabolites-status # Check status
make bacdive-metabolites-mapping # Full pipeline
# Quick iteration
make quick # Smaller dataset for testing
# Chemical database updates
make iupac-status && make pubchem-status # Check database status
make db-mapping # Rebuild properties DB
# Run individual mapping enhancements
python3 src/mapping/cas_to_chebi_upgrader.py --chebi-file chebi_nodes.tsv --input input.tsv --output output.tsv
python3 -m src.mapping.unified_mapper --kg-nodes merged-kg_nodes.tsv --composition-dir media_compositions| Issue | Solution |
|---|---|
make validate fails |
Run make status to see what's missing; pipeline creates files in pipeline_output/kg_mapping/ not separate directories |
| Missing ChEBI matches | File only created if unmapped compounds exist; check make oak-chebi-status |
| Property calculation fails | Verify chemical_properties.tsv exists; check compute_media_properties.log |
| Stage fails | Check previous stage completed; review stage log; run make status |
- Python 3.10+, Black (88 chars), isort (black profile)
- Type hints required (
mypy --disallow_untyped_defs) - Async I/O:
aiohttp/aiofilesfor downloads - Document processing: MarkItDown for PDFs, BeautifulSoup4 for HTML