Skip to content

ml-jku/moleculariq-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolecularIQ Benchmark Dataset

A comprehensive benchmark for evaluating large language models on molecular reasoning tasks

License: MIT Python 3.9+ RDKit

Count, index, and constraint generation questions across diverse chemical features

MolecularIQ benchmark statistics


🎯 Overview

MolecularIQ is a benchmark specifically designed to measure the structural reasoning abilities of large language models on molecules. Unlike many chemistry evaluation sets that rely on literature labels or surrogate predictors, MolecularIQ focuses only on tasks whose correctness can be verified algorithmically from the molecular graph itself. This makes it possible to distinguish genuine structural understanding from memorization or correlation-based answers.

📁 Repository Structure

moleculariq-benchmark/
├── 📂 src/                                # Source code
│   ├── 📂 a_dataset_pools/               # Stage A: Dataset pool creation
│   │   ├── 1_collect_pubchem_data.py
│   │   ├── 2_collect_external_test_set_molecules.py
│   │   ├── 3_standardize_pubchem_mols_and_remove_external_test_mols.py
│   │   ├── 4_create_train_test_pools.py
│   │   ├── 5_create_hard_test_pool_dataframe.py
│   │   └── utils/                        # External test set utilities
│   └── 📂 b_benchmark/                   # Stage B: Benchmark generation
│       ├── 1_compute_properties.py       # Compute ground truth properties
│       ├── 2_create_benchmark.py         # Generate final benchmark dataset
│       ├── task_names.py                 # Task name definitions
│       └── benchmark_generator/          # Generation logic (uses moleculariq-core)
│           ├── main.py                   # CLI entry point
│           ├── config.py                 # Configuration
│           ├── tasks/                    # Task generators (count, index, constraint)
│           ├── core/                     # Sampling, scoring, validation
│           └── output/                   # JSON & HuggingFace export
├── 📂 data/                              # Data artifacts (not tracked)
│   ├── dataset_pools/                    # Molecule pools
│   │   ├── external/                     # External benchmark molecules
│   │   ├── intermediate/                 # Pipeline intermediates
│   │   ├── processed/                    # Processed datasets
│   │   ├── pseudo_sdf/                   # Sample SDF for testing
│   │   └── pubchem_raw_sdf/              # Raw PubChem SDF files
│   └── benchmark/                        # Generated benchmark data
│       ├── properties.pkl                # Precomputed molecular properties
│       └── benchmark_dataset.json        # Final benchmark dataset
├── 📓 notebooks/                         # Analysis notebooks
│   └── overview_created_data.ipynb       # Data creation walkthrough
└── 📊 assets/                            # Documentation assets
    └── moleculariq_statistics.png

🔄 Data Creation Pipeline

Stage A: Dataset Pool Creation

1. Collect PubChem Data1_collect_pubchem_data.py

  • Extract SMILES and IUPAC names from PubChem SDF files
  • Filter molecules (carbon-containing, single-fragment)

2. Collect External Test Sets2_collect_external_test_set_molecules.py

  • Aggregate molecules from LLaSMol, ChemDFM, Ether0, ChemIQ benchmarks

3. Standardize and Filter3_standardize_pubchem_mols_and_remove_external_test_mols.py

  • Canonicalize SMILES
  • Remove molecules present in external benchmarks

4. Create Train/Test Pools4_create_train_test_pools.py

  • Cluster molecules using MinHash LSH on Morgan fingerprints
  • Split into: Training pool, Easy test set, Hard test set

5. Create Hard Test Pool DataFrame5_create_hard_test_pool_dataframe.py

  • Build structured dataframe with molecular complexity metrics

Stage B: Benchmark Generation

1. Compute Properties1_compute_properties.py

  • Calculate ground truth values for all molecular properties
  • Uses SymbolicSolver from moleculariq-core for accurate computation

2. Create Benchmark2_create_benchmark.py

  • Sample diverse datapoints across complexity dimensions
  • Generate questions using natural language templates
  • Create single/multi count, index, and constraint generation tasks
  • Export to JSON and HuggingFace dataset formats

🚀 Getting Started

Prerequisites

# Install moleculariq-core
pip install git+https://github.com/ml-jku/moleculariq-core.git

# Then install this package
pip install .  
# or pip install -e . for development

Quick Start

1. Download PubChem SDF files (optional - pseudo SDF included for testing)

# Download from https://pubchem.ncbi.nlm.nih.gov/docs/downloads
# Place in data/dataset_pools/pubchem_raw_sdf/

2. Run the data creation pipeline

# Stage A: Create molecule pools (run from repo root)
python src/a_dataset_pools/1_collect_pubchem_data.py
python src/a_dataset_pools/2_collect_external_test_set_molecules.py
python src/a_dataset_pools/3_standardize_pubchem_mols_and_remove_external_test_mols.py
python src/a_dataset_pools/4_create_train_test_pools.py
python src/a_dataset_pools/5_create_hard_test_pool_dataframe.py

# Stage B: Generate benchmark (run from repo root)
python src/b_benchmark/1_compute_properties.py
python src/b_benchmark/2_create_benchmark.py

3. Explore the created data

jupyter notebook notebooks/overview_created_data.ipynb

👨‍👧‍👦 MolecularIQ Family

This package is part of the MolecularIQ ecosystem:

Repository Purpose
moleculariq Central hub for the MolecularIQ benchmark ecosystem
moleculariq-leaderboard Leaderboard: HuggingFace space, displays results, handles submissions
moleculariq-core MolecularIQD and shared library providing core functionality, e.g. symbolic verifiers and question formatting
📍 moleculariq-benchmark Dataset creation: task definitions, symbolic verifiers implementations, question generator
moleculariq-eval Evaluation code: integration with lm-eval-harness, model configs, reward functions, extraction functions, and system prompts

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


About

MolecularIQ benchmark dataset creation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors