Scientific Paper Q&A using BM25 RAG

A retrieval-augmented generation (RAG) system for querying ML/AI research papers using BM25 sparse retrieval — no vector embeddings or external APIs required. Users ask natural language questions and receive grounded answers with citations to the source papers.

Overview

Most RAG systems rely on vector embeddings and similarity search. This project demonstrates that strong retrieval can be achieved with classical BM25 (Best Match 25), a keyword-based ranking algorithm widely used in information retrieval. Retrieved abstracts are passed as context to a locally-running LLM which synthesizes a cited answer.

Architecture

User Question
      ↓
BM25 Retriever (rank_bm25)
      ↓
Top-K Relevant Abstracts
      ↓
LLM Context Window (Ollama — qwen2.5)
      ↓
Grounded Answer + Citations

Project Structure

arxiv-rag/
├── data/
│   └── fetch_papers.py     # Fetches ArXiv abstracts via API
├── rag/
│   ├── retriever.py        # BM25 index and retrieval logic
│   └── generator.py        # LLM answer synthesis with citations
├── ui/
│   └── app.py              # Streamlit chat interface
├── requirements.txt
└── README.md

Features

Vectorless retrieval — BM25 ranking with no embeddings or vector database
Grounded answers — LLM is instructed to cite only the retrieved papers
Source transparency — every retrieved paper is shown with title, authors, BM25 score, and abstract preview
Adjustable top-k — slider to control how many papers are retrieved per query
Fully local — runs on Ollama, no external API required

Dataset

~500 ArXiv abstracts fetched via the ArXiv API across 8 ML/AI topic areas:

Machine Learning
Deep Learning
Natural Language Processing
Reinforcement Learning
Computer Vision
Large Language Models
Graph Neural Networks
Transformer Architecture

Tech Stack

Layer	Technology
Data	ArXiv API
Retrieval	BM25 (`rank_bm25`)
LLM	qwen2.5 via Ollama (local)
UI	Streamlit
Language	Python

Setup & Run

Prerequisites

Python 3.8+
Ollama installed and running with qwen2.5 pulled

ollama pull qwen2.5

1. Install dependencies

pip install -r requirements.txt

2. Fetch papers

python data/fetch_papers.py

This fetches ~500 unique ArXiv abstracts and saves them to data/papers.json.

3. Start the app

streamlit run ui/app.py

Open http://localhost:8501 in your browser.

Example Questions

"What methods are used for image segmentation?"
"How do transformers work in NLP?"
"What are the latest advances in reinforcement learning?"
"How are graph neural networks used in practice?"
"What are common techniques for training large language models?"
"What are the challenges of deploying machine learning models in production?"

Notes

data/papers.json is not committed — regenerate it with python data/fetch_papers.py
BM25 index is built in-memory at startup (~1 second for 500 papers)
Ollama must be running before starting the app (ollama serve if not auto-started)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
data		data
rag		rag
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Paper Q&A using BM25 RAG

Overview

Architecture

Project Structure

Features

Dataset

Tech Stack

Setup & Run

Prerequisites

1. Install dependencies

2. Fetch papers

3. Start the app

Example Questions

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Q&A using BM25 RAG

Overview

Architecture

Project Structure

Features

Dataset

Tech Stack

Setup & Run

Prerequisites

1. Install dependencies

2. Fetch papers

3. Start the app

Example Questions

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages