Harnessing Large Language Models for Scientific Novelty Detection

This repository contains the official implementation of LLM-based Scientific Novelty Detection, a framework that leverages large language models (LLMs) for benchmark construction, idea-level retriever distillation, and retrieval-augmented novelty detection.

Scientific novelty detection aims to determine whether a research idea is conceptually novel with respect to existing literature. Instead of relying only on surface-level textual similarity, this project focuses on idea-level conceptual alignment between research ideas.

Overview

The framework consists of three main components:

Benchmark Dataset Construction
We construct novelty detection datasets with topological closure and compactness. Seed papers are collected from specific research domains, and their references are crawled to form a closed corpus. LLMs are then used to summarize each paper into compact idea descriptions.
LLM-based Knowledge Distillation for Idea Retrieval
We generate synthesized non-novel ideas from anchor ideas using LLMs, including rephrased, partial, and incremental ideas. These anchor-synthesized idea pairs are used to fine-tune a lightweight retriever with contrastive learning, aligning the retriever with idea-level similarity rather than surface textual similarity.
RAG-based Novelty Detection
Given a target research idea, the distilled retriever first retrieves top-K conceptually related ideas. Then, an LLM cross-checks the target idea against the retrieved candidates and produces novelty scores. A decision tree classifier is used to make the final Novel / Non-Novel prediction.

Project Structure

.
├── README.md
├── requirements.txt
├── data/
│   ├── raw/
│   │   ├── acl/                         # Raw ACL/NLP-domain data
│   │   └── marketing/                   # Raw Marketing-domain data
│   └── processed/
│       └── nc/
│           ├── acl/                     # Processed ACL/NLP novelty detection data
│           └── marketing/               # Processed Marketing novelty detection data
├── scripts/
│   ├── utils/
│   │   ├── baseline.py                  # Baseline novelty detection methods
│   │   ├── config.json                  # Configuration file
│   │   ├── csv_processing.py            # CSV processing utilities
│   │   ├── data_preprocessing.py        # Data preprocessing script
│   │   ├── dataset.py                   # Dataset loading and processing
│   │   ├── embeddings.py                # Embedding generation utilities
│   │   ├── general.py                   # General helper functions
│   │   ├── llm.py                       # LLM calling and prompting utilities
│   │   ├── paper_search.py              # Paper search / retrieval utilities
│   │   └── pdf_processing.py            # PDF parsing and processing
│   ├── retrieval/
│   │   ├── train.py                     # Train idea-level retriever
│   │   └── test.py                      # Evaluate idea retrieval performance
│   └── nc/
│       ├── classifier.py                # Decision tree classifier for novelty detection
│       ├── deepseek_parallel.py         # Parallel LLM-based novelty scoring
│       └── run_novelty_checking_integrated.py  # Integrated novelty checking pipeline

Requirements

Python 3.8+
PyTorch
transformers
sentence-transformers
scikit-learn
numpy
pandas
tqdm

Install dependencies with:

pip install -r requirements.txt

Datasets

We provide two benchmark datasets for scientific novelty detection:

Dataset	Domain	Seed Papers	Reference Corpus	Description
Marketing	Social Science / Marketing	470	12,577 unique papers	Papers collected from Journal of Marketing and Journal of Marketing Research
NLP	Natural Language Processing	3,533	32,239 unique papers	Papers collected from recent ACL conferences

The datasets are designed with:

Topological closure: reference papers of seed papers are included to approximate the prior literature used for novelty judgment.
Compactness: each paper is represented by an LLM-generated idea summary, including its core contribution, hypothesis, and methodology.
Synthesized non-novel ideas: LLM-generated rephrased, partial, and incremental ideas are used for retriever training and novelty detection evaluation.

Quick Start

Run the integrated novelty detection pipeline

python scripts/nc/run_novelty_checking_integrated.py \
    --dataset acl \
    --config scripts/utils/config.json

For the Marketing dataset:

python scripts/nc/run_novelty_checking_integrated.py \
    --dataset marketing \
    --config scripts/utils/config.json

Run Step by Step

Step 1: Data preprocessing

python scripts/utils/data_preprocessing.py \
    --dataset acl \
    --input_dir data/raw/acl \
    --output_dir data/processed/nc/acl

For Marketing:

python scripts/utils/data_preprocessing.py \
    --dataset marketing \
    --input_dir data/raw/marketing \
    --output_dir data/processed/nc/marketing

Step 2: Generate / load embeddings

python scripts/utils/embeddings.py \
    --dataset acl \
    --data_dir data/processed/nc/acl

Step 3: Train the idea-level retriever

python scripts/retrieval/train.py \
    --dataset acl \
    --data_dir data/processed/nc/acl

Step 4: Evaluate idea retrieval

python scripts/retrieval/test.py \
    --dataset acl \
    --data_dir data/processed/nc/acl

Step 5: Run LLM-based novelty checking

python scripts/nc/deepseek_parallel.py \
    --dataset acl \
    --data_dir data/processed/nc/acl

Step 6: Train / evaluate the novelty classifier

python scripts/nc/classifier.py \
    --dataset acl \
    --data_dir data/processed/nc/acl

Key Hyperparameters

Parameter	Default	Description
`dataset`	`acl`	Dataset name, selected from `acl` and `marketing`
`retriever`	`bge`	Retriever backbone for idea retrieval
`learning_rate`	`2e-5`	Learning rate for retriever fine-tuning
`batch_size`	`16`	Batch size for contrastive retriever training
`temperature`	`0.05`	Temperature parameter in contrastive learning
`top_k`	`5 / 10`	Number of retrieved ideas for RAG-based novelty detection
`llm`	`deepseek-reasoner`	LLM backbone for novelty checking
`classifier`	`decision_tree`	Classifier used for the final Novel / Non-Novel decision

Method Details

LLM-based KD Retriever

Given an anchor idea and its LLM-synthesized non-novel variant, the retriever is trained to pull conceptually similar ideas closer while pushing unrelated ideas away.

The training objective follows a contrastive learning formulation:

L = - log exp(sim(f(s_i), f(g_i)) / τ)
        / Σ_j exp(sim(f(s_j), f(g_i)) / τ)

where:

s_i is an anchor idea from the novelty corpus,
g_i is a synthesized idea generated by an LLM,
f(.) is the retriever encoder,
sim(.) is cosine similarity,
τ is the temperature.

RAG-based Novelty Detection

For each target idea, the trained retriever retrieves the top-K most conceptually related ideas. An LLM then compares the target idea against these retrieved candidates and outputs novelty scores. These scores are fed into a supervised decision tree classifier for the final novelty prediction.

Results

The proposed LLM-KD retriever consistently improves idea retrieval performance across both Marketing and NLP datasets. In the novelty detection task, the RAG-KD method achieves the best overall performance compared with heuristic novelty metrics, LLM-only baselines, and RAG with vanilla retrievers.

Citation

If you find this repository useful, please cite our paper:

@inproceedings{liu2026scientificnovelty,
  title={Harnessing Large Language Models for Scientific Novelty Detection},
  author={Liu, Yan and Yang, Zonglin and Poria, Soujanya and Nguyen, Thanh-Son and Cambria, Erik},
  booktitle={International Conference on Artificial Neural Networks},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
scripts		scripts
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harnessing Large Language Models for Scientific Novelty Detection

Overview

Project Structure

Requirements

Datasets

Quick Start

Run the integrated novelty detection pipeline

Run Step by Step

Step 1: Data preprocessing

Step 2: Generate / load embeddings

Step 3: Train the idea-level retriever

Step 4: Evaluate idea retrieval

Step 5: Run LLM-based novelty checking

Step 6: Train / evaluate the novelty classifier

Key Hyperparameters

Method Details

LLM-based KD Retriever

RAG-based Novelty Detection

Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Harnessing Large Language Models for Scientific Novelty Detection

Overview

Project Structure

Requirements

Datasets

Quick Start

Run the integrated novelty detection pipeline

Run Step by Step

Step 1: Data preprocessing

Step 2: Generate / load embeddings

Step 3: Train the idea-level retriever

Step 4: Evaluate idea retrieval

Step 5: Run LLM-based novelty checking

Step 6: Train / evaluate the novelty classifier

Key Hyperparameters

Method Details

LLM-based KD Retriever

RAG-based Novelty Detection

Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages