Skip to content

alim1496/arxiv-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scientific Paper Q&A using BM25 RAG

A retrieval-augmented generation (RAG) system for querying ML/AI research papers using BM25 sparse retrieval — no vector embeddings or external APIs required. Users ask natural language questions and receive grounded answers with citations to the source papers.

Overview

Most RAG systems rely on vector embeddings and similarity search. This project demonstrates that strong retrieval can be achieved with classical BM25 (Best Match 25), a keyword-based ranking algorithm widely used in information retrieval. Retrieved abstracts are passed as context to a locally-running LLM which synthesizes a cited answer.

Architecture

User Question
      ↓
BM25 Retriever (rank_bm25)
      ↓
Top-K Relevant Abstracts
      ↓
LLM Context Window (Ollama — qwen2.5)
      ↓
Grounded Answer + Citations

Project Structure

arxiv-rag/
├── data/
│   └── fetch_papers.py     # Fetches ArXiv abstracts via API
├── rag/
│   ├── retriever.py        # BM25 index and retrieval logic
│   └── generator.py        # LLM answer synthesis with citations
├── ui/
│   └── app.py              # Streamlit chat interface
├── requirements.txt
└── README.md

Features

  • Vectorless retrieval — BM25 ranking with no embeddings or vector database
  • Grounded answers — LLM is instructed to cite only the retrieved papers
  • Source transparency — every retrieved paper is shown with title, authors, BM25 score, and abstract preview
  • Adjustable top-k — slider to control how many papers are retrieved per query
  • Fully local — runs on Ollama, no external API required

Dataset

~500 ArXiv abstracts fetched via the ArXiv API across 8 ML/AI topic areas:

  • Machine Learning
  • Deep Learning
  • Natural Language Processing
  • Reinforcement Learning
  • Computer Vision
  • Large Language Models
  • Graph Neural Networks
  • Transformer Architecture

Tech Stack

Layer Technology
Data ArXiv API
Retrieval BM25 (rank_bm25)
LLM qwen2.5 via Ollama (local)
UI Streamlit
Language Python

Setup & Run

Prerequisites

  • Python 3.8+
  • Ollama installed and running with qwen2.5 pulled
ollama pull qwen2.5

1. Install dependencies

pip install -r requirements.txt

2. Fetch papers

python data/fetch_papers.py

This fetches ~500 unique ArXiv abstracts and saves them to data/papers.json.

3. Start the app

streamlit run ui/app.py

Open http://localhost:8501 in your browser.

Example Questions

  • "What methods are used for image segmentation?"
  • "How do transformers work in NLP?"
  • "What are the latest advances in reinforcement learning?"
  • "How are graph neural networks used in practice?"
  • "What are common techniques for training large language models?"
  • "What are the challenges of deploying machine learning models in production?"

Notes

  • data/papers.json is not committed — regenerate it with python data/fetch_papers.py
  • BM25 index is built in-memory at startup (~1 second for 500 papers)
  • Ollama must be running before starting the app (ollama serve if not auto-started)

About

A retrieval-augmented generation (RAG) system for querying ML/AI research papers using BM25 sparse retrieval — no vector embeddings or external APIs required. Users ask natural language questions and receive grounded answers with citations to the source papers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages