🧬 LLMs4OL 2025 – Task C1 (OBI)

Hybrid Embedding–Clustering–LLM Pipeline for Taxonomy Discovery

This repository contains our solution for Task C1 – Ontology for Biomedical Investigations (OBI) in the LLMs4OL 2025: Large Language Models for Ontology Learning Challenge.

The goal is to automatically discover taxonomic (is-a) relationships between biomedical investigation types.

Given only a list of ontology types, the system predicts parent → child hierarchical relations and outputs them in the required _pairs.json format.

🧠 Methodology Overview

Our approach is a three-stage hybrid pipeline:

Sentence Embeddings → semantic representation
Clustering → local ontology grouping
Large Language Model (LLM) → hierarchical reasoning

This design reduces noise, improves scalability, and enables directional is-a prediction.

🖼️ Methodology Diagram

Replace methodology.png with your actual diagram filename

⚙️ Pipeline

Stage 1 — Sentence Embeddings

Model: sentence-transformers/all-MiniLM-L6-v2
Converts each ontology type into a dense vector
Captures semantic similarity between biomedical terms

Stage 2 — Clustering

Algorithm: MiniBatch K-Means
Cluster size ≈ 50 terms per cluster
Purpose:
- Reduce search space
- Group semantically related types
- Enable local hierarchy discovery

Stage 3 — LLM-based is-a Prediction

Model: unsloth/Qwen3-1.7B-unsloth-bnb-4bit
Quantized (4-bit) for memory efficiency
Candidate parent–child pairs generated using Top-K nearest neighbors
LLM performs Yes/No classification for is-a relations
Batch inference used for faster processing

🧪 Implemented Approaches

1️⃣ Cosine Similarity Baseline

File: cosine-similarity.ipynb

Pipeline:

Sentence-BERT embeddings
Pairwise cosine similarity
Threshold = 0.95 → predict is-a

Limitation: captures similarity but cannot model hierarchy direction

2️⃣ K-Means Clustering + Few-Shot LLM

File: kmeansclustuering-llm.ipynb

Pipeline:

Embeddings → clustering
Top-K candidate pairs
Few-shot prompting with biomedical examples
LLM classifies is-a relations

3️⃣ Embeddings + Clustering + Zero-Shot LLM (Final)

File: sentence-embeddings-llm.ipynb

Pipeline:

Embeddings
MiniBatch K-Means clustering
Nearest neighbor candidate generation
Zero-shot LLM classification
JSON output in required format

This is the final and most scalable system.

📂 Repository Structure

. ├── cosine-similarity.ipynb ├── kmeansclustuering-llm.ipynb ├── sentence-embeddings-llm.ipynb ├── methodology.png └── README.md

Data

The dataset is publicly available from the official challenge repository:

🔗 https://github.com/sciknoworg/LLMs4OL-Challenge/tree/main/2025/TaskC-TaxonomyDiscovery/OBI

File	Split	Description
`obitrainpairs.json`	Train	8,249 labeled is-a pairs
`obitraintypes.txt`	Train	4,237 unique ontology types
`obitesttypes.txt`	Test	2,821 ontology types

Do not re-upload the data to this repo. Download directly from the source above.

How to Run

Clone the official challenge repo or download the OBI files directly
On Kaggle: upload the files as a dataset named data-ontology
Open any notebook and run all cells
Output saved to outputs/submission_pairs.json [ {"parent": "distillation", "child": "simple distillation"} ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 LLMs4OL 2025 – Task C1 (OBI)

Hybrid Embedding–Clustering–LLM Pipeline for Taxonomy Discovery

🧠 Methodology Overview

🖼️ Methodology Diagram

⚙️ Pipeline

Stage 1 — Sentence Embeddings

Stage 2 — Clustering

Stage 3 — LLM-based is-a Prediction

🧪 Implemented Approaches

1️⃣ Cosine Similarity Baseline

2️⃣ K-Means Clustering + Few-Shot LLM

3️⃣ Embeddings + Clustering + Zero-Shot LLM (Final)

📂 Repository Structure

Data

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
cosine-similarity.ipynb		cosine-similarity.ipynb
kmeansclustuering-llm.ipynb		kmeansclustuering-llm.ipynb
methodology.png		methodology.png
sentence-embeddings-llm.ipynb		sentence-embeddings-llm.ipynb

Folders and files

Latest commit

History

Repository files navigation

🧬 LLMs4OL 2025 – Task C1 (OBI)

Hybrid Embedding–Clustering–LLM Pipeline for Taxonomy Discovery

🧠 Methodology Overview

🖼️ Methodology Diagram

⚙️ Pipeline

Stage 1 — Sentence Embeddings

Stage 2 — Clustering

Stage 3 — LLM-based is-a Prediction

🧪 Implemented Approaches

1️⃣ Cosine Similarity Baseline

2️⃣ K-Means Clustering + Few-Shot LLM

3️⃣ Embeddings + Clustering + Zero-Shot LLM (Final)

📂 Repository Structure

Data

How to Run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages