MolecularIQ Benchmark Dataset

A comprehensive benchmark for evaluating large language models on molecular reasoning tasks

Count, index, and constraint generation questions across diverse chemical features

🎯 Overview

MolecularIQ is a benchmark specifically designed to measure the structural reasoning abilities of large language models on molecules. Unlike many chemistry evaluation sets that rely on literature labels or surrogate predictors, MolecularIQ focuses only on tasks whose correctness can be verified algorithmically from the molecular graph itself. This makes it possible to distinguish genuine structural understanding from memorization or correlation-based answers.

📁 Repository Structure

moleculariq-benchmark/
├── 📂 src/                                # Source code
│   ├── 📂 a_dataset_pools/               # Stage A: Dataset pool creation
│   │   ├── 1_collect_pubchem_data.py
│   │   ├── 2_collect_external_test_set_molecules.py
│   │   ├── 3_standardize_pubchem_mols_and_remove_external_test_mols.py
│   │   ├── 4_create_train_test_pools.py
│   │   ├── 5_create_hard_test_pool_dataframe.py
│   │   └── utils/                        # External test set utilities
│   └── 📂 b_benchmark/                   # Stage B: Benchmark generation
│       ├── 1_compute_properties.py       # Compute ground truth properties
│       ├── 2_create_benchmark.py         # Generate final benchmark dataset
│       ├── task_names.py                 # Task name definitions
│       └── benchmark_generator/          # Generation logic (uses moleculariq-core)
│           ├── main.py                   # CLI entry point
│           ├── config.py                 # Configuration
│           ├── tasks/                    # Task generators (count, index, constraint)
│           ├── core/                     # Sampling, scoring, validation
│           └── output/                   # JSON & HuggingFace export
├── 📂 data/                              # Data artifacts (not tracked)
│   ├── dataset_pools/                    # Molecule pools
│   │   ├── external/                     # External benchmark molecules
│   │   ├── intermediate/                 # Pipeline intermediates
│   │   ├── processed/                    # Processed datasets
│   │   ├── pseudo_sdf/                   # Sample SDF for testing
│   │   └── pubchem_raw_sdf/              # Raw PubChem SDF files
│   └── benchmark/                        # Generated benchmark data
│       ├── properties.pkl                # Precomputed molecular properties
│       └── benchmark_dataset.json        # Final benchmark dataset
├── 📓 notebooks/                         # Analysis notebooks
│   └── overview_created_data.ipynb       # Data creation walkthrough
└── 📊 assets/                            # Documentation assets
    └── moleculariq_statistics.png

🔄 Data Creation Pipeline

Stage A: Dataset Pool Creation

1. Collect PubChem Data → 1_collect_pubchem_data.py

Extract SMILES and IUPAC names from PubChem SDF files
Filter molecules (carbon-containing, single-fragment)

2. Collect External Test Sets → 2_collect_external_test_set_molecules.py

Aggregate molecules from LLaSMol, ChemDFM, Ether0, ChemIQ benchmarks

3. Standardize and Filter → 3_standardize_pubchem_mols_and_remove_external_test_mols.py

Canonicalize SMILES
Remove molecules present in external benchmarks

4. Create Train/Test Pools → 4_create_train_test_pools.py

Cluster molecules using MinHash LSH on Morgan fingerprints
Split into: Training pool, Easy test set, Hard test set

5. Create Hard Test Pool DataFrame → 5_create_hard_test_pool_dataframe.py

Build structured dataframe with molecular complexity metrics

Stage B: Benchmark Generation

1. Compute Properties → 1_compute_properties.py

Calculate ground truth values for all molecular properties
Uses SymbolicSolver from moleculariq-core for accurate computation

2. Create Benchmark → 2_create_benchmark.py

Sample diverse datapoints across complexity dimensions
Generate questions using natural language templates
Create single/multi count, index, and constraint generation tasks
Export to JSON and HuggingFace dataset formats

🚀 Getting Started

Prerequisites

# Install moleculariq-core
pip install git+https://github.com/ml-jku/moleculariq-core.git

# Then install this package
pip install .  
# or pip install -e . for development

Quick Start

1. Download PubChem SDF files (optional - pseudo SDF included for testing)

# Download from https://pubchem.ncbi.nlm.nih.gov/docs/downloads
# Place in data/dataset_pools/pubchem_raw_sdf/

2. Run the data creation pipeline

# Stage A: Create molecule pools (run from repo root)
python src/a_dataset_pools/1_collect_pubchem_data.py
python src/a_dataset_pools/2_collect_external_test_set_molecules.py
python src/a_dataset_pools/3_standardize_pubchem_mols_and_remove_external_test_mols.py
python src/a_dataset_pools/4_create_train_test_pools.py
python src/a_dataset_pools/5_create_hard_test_pool_dataframe.py

# Stage B: Generate benchmark (run from repo root)
python src/b_benchmark/1_compute_properties.py
python src/b_benchmark/2_create_benchmark.py

3. Explore the created data

jupyter notebook notebooks/overview_created_data.ipynb

👨‍👧‍👦 MolecularIQ Family

This package is part of the MolecularIQ ecosystem:

Repository	Purpose
moleculariq	Central hub for the MolecularIQ benchmark ecosystem
moleculariq-leaderboard	Leaderboard: HuggingFace space, displays results, handles submissions
moleculariq-core	MolecularIQD and shared library providing core functionality, e.g. symbolic verifiers and question formatting
📍 moleculariq-benchmark	Dataset creation: task definitions, symbolic verifiers implementations, question generator
moleculariq-eval	Evaluation code: integration with lm-eval-harness, model configs, reward functions, extraction functions, and system prompts

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolecularIQ Benchmark Dataset

🎯 Overview

📁 Repository Structure

🔄 Data Creation Pipeline

Stage A: Dataset Pool Creation

Stage B: Benchmark Generation

🚀 Getting Started

Prerequisites

Quick Start

1. Download PubChem SDF files (optional - pseudo SDF included for testing)

2. Run the data creation pipeline

3. Explore the created data

👨‍👧‍👦 MolecularIQ Family

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MolecularIQ Benchmark Dataset

🎯 Overview

📁 Repository Structure

🔄 Data Creation Pipeline

Stage A: Dataset Pool Creation

Stage B: Benchmark Generation

🚀 Getting Started

Prerequisites

Quick Start

1. Download PubChem SDF files (optional - pseudo SDF included for testing)

2. Run the data creation pipeline

3. Explore the created data

👨‍👧‍👦 MolecularIQ Family

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages