This repository contains our solution for Task C1 – Ontology for Biomedical Investigations (OBI) in the LLMs4OL 2025: Large Language Models for Ontology Learning Challenge.
The goal is to automatically discover taxonomic (is-a) relationships between biomedical investigation types.
Given only a list of ontology types, the system predicts parent → child hierarchical relations and outputs them in the required _pairs.json format.
Our approach is a three-stage hybrid pipeline:
- Sentence Embeddings → semantic representation
- Clustering → local ontology grouping
- Large Language Model (LLM) → hierarchical reasoning
This design reduces noise, improves scalability, and enables directional is-a prediction.
Replace
methodology.pngwith your actual diagram filename
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Converts each ontology type into a dense vector
- Captures semantic similarity between biomedical terms
- Algorithm: MiniBatch K-Means
- Cluster size ≈ 50 terms per cluster
- Purpose:
- Reduce search space
- Group semantically related types
- Enable local hierarchy discovery
- Model:
unsloth/Qwen3-1.7B-unsloth-bnb-4bit - Quantized (4-bit) for memory efficiency
- Candidate parent–child pairs generated using Top-K nearest neighbors
- LLM performs Yes/No classification for is-a relations
- Batch inference used for faster processing
File: cosine-similarity.ipynb
Pipeline:
- Sentence-BERT embeddings
- Pairwise cosine similarity
- Threshold = 0.95 → predict is-a
Limitation: captures similarity but cannot model hierarchy direction
File: kmeansclustuering-llm.ipynb
Pipeline:
- Embeddings → clustering
- Top-K candidate pairs
- Few-shot prompting with biomedical examples
- LLM classifies is-a relations
File: sentence-embeddings-llm.ipynb
Pipeline:
- Embeddings
- MiniBatch K-Means clustering
- Nearest neighbor candidate generation
- Zero-shot LLM classification
- JSON output in required format
This is the final and most scalable system.
. ├── cosine-similarity.ipynb ├── kmeansclustuering-llm.ipynb ├── sentence-embeddings-llm.ipynb ├── methodology.png └── README.md
The dataset is publicly available from the official challenge repository:
🔗 https://github.com/sciknoworg/LLMs4OL-Challenge/tree/main/2025/TaskC-TaxonomyDiscovery/OBI
| File | Split | Description |
|---|---|---|
obitrainpairs.json |
Train | 8,249 labeled is-a pairs |
obitraintypes.txt |
Train | 4,237 unique ontology types |
obitesttypes.txt |
Test | 2,821 ontology types |
Do not re-upload the data to this repo. Download directly from the source above.
- Clone the official challenge repo or download the OBI files directly
- On Kaggle: upload the files as a dataset named
data-ontology - Open any notebook and run all cells
- Output saved to
outputs/submission_pairs.json[ {"parent": "distillation", "child": "simple distillation"} ]
