A production-grade Retrieval Augmented Generation (RAG) system with a multi-agent architecture, powered by Ollama for local LLM inference. This system features intelligent query parsing, semantic search, and multi-turn conversation support with a modern React + TypeScript frontend.
The Local MultiAgentic RAG System is designed to provide an enterprise-grade solution for building intelligent conversational applications that can access and reason over custom knowledge bases. It combines multiple specialized agents to break down complex queries, retrieve relevant information, and generate contextually accurate responses.
- Multi-Agent Architecture: Specialized agents for query parsing, refinement, and response generation
- Semantic Search: Vector-based retrieval using Chroma and Ollama embeddings
- Local Inference: Run entirely on your machine using Ollama - no external APIs required
- Modular Backend: FastAPI-based architecture with clean separation of concerns
- Modern Frontend: React + TypeScript with responsive, cyberpunk-themed UI
- SQLite Persistence: Store conversations and metadata locally
- Session Management: Multi-session support with full conversation history
- Production-Ready: Error handling, logging, and scalable architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React + TS) β
β ββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββββ β
β β Sidebar β Chat UI β Sources Panel β β
β ββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββββ β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β HTTP/WebSocket (REST API)
ββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (Python) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β API Routes Layer ββ
β β ββ Chat Endpoints ββ Knowledge Base ββ
β β ββ Session Management ββ File Upload ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Multi-Agent RAG Pipeline ββ
β β ββββββββββββ ββββββββββββ ββββββββββββ ββ
β β β Query ββ β RAG ββ βResponse β ββ
β β β Parser β β Query β βGenerator β ββ
β β β Agent β β Agent β β Agent β ββ
β β ββββββββββββ ββββββββββββ ββββββββββββ ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Data & Storage Layer ββ
β β ββ Vector DB (Chroma) ββ Chat DB (SQLite) ββ
β β ββ PDF Processing ββ Embeddings (Ollama) ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The system implements a sophisticated multi-agent verification pipeline:
- Purpose: Analyzes user queries in context of conversation history
- Functionality:
- Resolves pronouns and contextual references
- Splits compound questions into focused sub-queries
- Generates explicit, self-contained search queries
- Handles follow-up questions by enriching with prior context
Example:
User: "Tell me more about them"
Context: [Prior discussion about Indian laws]
β Resolved: "Indian laws detailed explanation"
- Purpose: Optimizes queries for vector database search
- Functionality:
- Refines parsed queries for semantic similarity
- Adds domain-specific keywords and context
- Ensures optimal retrieval from knowledge base
- Adapts to conversation context
Example:
Query: "What are fundamental rights?"
Context: [Discussion about Indian Constitution]
β Refined: "Fundamental rights Indian Constitution Article 12-35 explanation"
- Purpose: Generates accurate, context-aware responses
- Functionality:
- Synthesizes information from retrieved chunks
- Maintains consistency with conversation history
- Cites sources appropriately
- Falls back gracefully when information is unavailable
Verification mechanisms:
- Context relevance scoring
- Source attribution
- Factual consistency checking
- Conversation coherence validation
Local_MultiAgentic_RAG_System/
βββ backend/ # Backend FastAPI application
β βββ config/
β β βββ settings.py # Central configuration
β βββ core/
β β βββ rag_chat.py # RAG pipeline orchestration
β βββ agents/
β β βββ models.py # Model configurations
β β βββ query_parser.py # Query resolution agent
β β βββ rag_query_agent.py # Query refinement agent
β β βββ response_agent.py # Response generation agent
β βββ modules/
β β βββ embedding_function.py # Ollama embeddings
β β βββ pdf_loader.py # PDF document loading
β β βββ text_splitter.py # Document chunking
β β βββ vector_db.py # Chroma operations
β βββ database/
β β βββ chat_db.py # SQLite operations
β βββ api/
β β βββ routes/
β β βββ chat.py # Chat endpoints
β β βββ knowledge.py # Knowledge base endpoints
β β βββ __init__.py
β βββ main.py # Uvicorn, startup config (imported as main.py in root)
β
βββ frontend/ # React + TypeScript application
β βββ src/
β β βββ components/
β β β βββ Chat/ # Chat interface
β β β βββ Sidebar/ # Session management
β β β βββ SourcesPanel/ # Retrieved sources display
β β β βββ Common/ # Shared components
β β βββ pages/ # Page components
β β βββ services/
β β β βββ apiService.ts # API client
β β βββ hooks/
β β β βββ index.ts # Custom React hooks
β β βββ types/
β β β βββ index.ts # TypeScript types
β β βββ utils/
β β β βββ index.ts # Utility functions
β β βββ styles/
β β β βββ globals.css # Global styles
β β β βββ theme.ts # Theme configuration
β β βββ App.tsx # Main app component
β β βββ main.tsx # React entry point
β βββ public/ # Static assets
β βββ package.json
β βββ tsconfig.json
β βββ vite.config.ts
β βββ index.html
β
βββ knowledge_base/ # PDF documents folder
βββ chroma_db/ # Vector database storage
βββ main.py # FastAPI entry point
βββ requirements_backend.txt # Python dependencies
βββ requirements.txt # Legacy
βββ README.md # This file
βββ LICENSE
- Python 3.9+
- Node.js 18+
- Ollama (Download from ollama.ai)
- Install Python dependencies:
pip install -r requirements_backend.txt- Pull required Ollama models:
# LLM model for agents
ollama pull qwen2.5:3b
# Embedding model for semantic search
ollama pull bge-m3:latest- Verify Ollama is running (should be accessible at
http://localhost:11434)
- Navigate to frontend directory:
cd frontend- Install Node dependencies:
npm install- Create environment file:
cp .env.example .env
# Edit .env if needed (default points to localhost:8000)From project root:
python main.pyThe backend will start at http://localhost:8000
Available endpoints:
- API:
http://localhost:8000/api/ - Docs:
http://localhost:8000/docs
In a new terminal:
cd frontend
npm run devFrontend will be available at http://localhost:3000
-
Add Knowledge Base:
- Place PDF files in
./knowledge_base/directory - Or use the upload feature in the UI
- Frontend will automatically index new PDFs
- Place PDF files in
-
Chat Interface:
- Type questions in the input box
- Chat history is automatically saved in SQLite
- Sources are displayed alongside responses
-
Session Management:
- Create new chat sessions using the "+" button
- View all past sessions in the sidebar
- Delete sessions to clean up
Stores chat session metadata:
CREATE TABLE sessions (
id TEXT PRIMARY KEY,
title TEXT DEFAULT 'New Chat',
created_at TIMESTAMP,
updated_at TIMESTAMP,
metadata TEXT -- JSON metadata
);Stores all messages in sessions:
CREATE TABLE messages (
id INTEGER PRIMARY KEY,
session_id TEXT,
role TEXT, -- 'user' or 'assistant'
content TEXT,
sources TEXT, -- JSON array of sources
tokens_used INTEGER,
timestamp TIMESTAMP,
FOREIGN KEY (session_id) REFERENCES sessions(id)
);Tracks which chunks were used for each message:
CREATE TABLE chunk_references (
id INTEGER PRIMARY KEY,
message_id INTEGER,
chunk_id TEXT,
source_file TEXT,
page_number INTEGER,
relevance_score REAL,
FOREIGN KEY (message_id) REFERENCES messages(id)
);Stores additional session metadata:
CREATE TABLE conversation_metadata (
id INTEGER PRIMARY KEY,
session_id TEXT,
key TEXT,
value TEXT,
FOREIGN KEY (session_id) REFERENCES sessions(id)
);- Collection:
pdf_chunks - Embeddings: bge-m3 (384-dimensional vectors)
- Similarity Metric: Cosine similarity
- Chunk Size: 600 tokens
- Chunk Overlap: 200 tokens
- Threshold: 0.6 similarity score
- Model: qwen2.5:3b
- Input: User query + recent conversation context
- Output: List of explicit search queries
- Key Features:
- Pronoun resolution
- Context enrichment
- Query decomposition
- Model: qwen2.5:3b
- Input: Parsed query + conversation context
- Output: Optimized search query
- Key Features:
- Keyword extraction
- Domain adaptation
- Semantic optimization
- Model: qwen2.5:3b
- Input: Question + retrieved context + history
- Output: Natural language response
- Key Features:
- Context synthesis
- Source attribution
- Hallucination prevention
POST /api/chat/message
- Send a message and get a response
- Request: { message: string, session_id?: string }
- Response: { response: string, session_id: string, sources: [] }
WebSocket /api/chat/ws/{session_id}
- Streaming responses via WebSocket
- Message format: { message: string }
GET /api/chat/sessions
- Get all sessions
GET /api/chat/session/{session_id}
- Get specific session with messages
POST /api/chat/session/create
- Create new session
DELETE /api/chat/session/{session_id}
- Delete session
GET /api/knowledge/stats
- Get knowledge base statistics
GET /api/knowledge/structure
- Get KB structure (files β pages β chunks)
GET /api/knowledge/chunks
- Get all indexed chunks
POST /api/knowledge/refresh
- Refresh/reindex knowledge base
GET /api/knowledge/search?query=...&k=5
- Search knowledge base
- Chat: Main conversation interface with auto-scroll
- Sidebar: Session management with quick access
- SourcesPanel: Display retrieved sources with scores
- Common: Reusable UI components
useChat(): Chat state managementuseSessions(): Session managementuseKnowledgeBase(): Knowledge base operationsuseWebSocket(): WebSocket connection handling
- Cyberpunk-inspired dark theme
- Responsive design (mobile, tablet, desktop)
- Smooth animations and transitions
- Accessibility-focused
- CORS configured for localhost only (configure for production)
- No external API calls - completely local
- PDFs processed locally without transmission
- SQLite database is local and encrypted via filesystem permissions
- Implement authentication layer for production deployment
- Chunking Strategy: 600 tokens with 200-token overlap optimizes balance between context and retrieval
- Similarity Threshold: 0.6 ensures relevant results while maintaining precision
- Context Window: Last 3 conversational turns (6 messages) for agent context
- Embedding Model: bge-m3 provides high-quality semantic representations
- Create new agent file in
backend/agents/ - Implement agent logic using Ollama chat API
- Register in
backend/core/rag_chat.py - Update API routes if needed
- Modify component files in
frontend/src/components/ - Update styles in component
.cssfiles - Extend types in
frontend/src/types/index.ts - Add hooks in
frontend/src/hooks/index.ts
- Add PDFs to
knowledge_base/directory - Use upload endpoint to add files programmatically
- System automatically detects and indexes new PDFs
- Use
/api/knowledge/refreshto reprocess all files
- FastAPI: Modern async web framework
- Uvicorn: ASGI server
- LangChain: LLM and vector store abstractions
- Chroma: Vector database
- Ollama: Local LLM inference
- SQLite3: Persistent message storage
- React 18: UI framework
- TypeScript: Type-safe JavaScript
- Axios: HTTP client
- Vite: Build tool and dev server
- Marked: Markdown rendering
- Lucide React: Icon library
# Verify Ollama is running
ollama list
# Re-pull required models
ollama pull qwen2.5:3b
ollama pull bge-m3:latest- Ensure port 8000 is available
- Check CORS settings if frontend can't reach backend
- Verify Ollama is running on port 11434
cd frontend
rm -rf node_modules package-lock.json
npm install
npm run buildThis project is licensed under the MIT License - see LICENSE file for details.
For issues, questions, or contributions:
- Check existing documentation
- Review the project structure and code comments
- Test locally before submitting changes
- Follow the existing code style and patterns
- User authentication and authorization
- Fine-tuning support for domain-specific models
- Advanced query expansion and synonymy handling
- Document versioning and management
- Real-time collaboration features
- Advanced analytics and insights
- Multi-language support
- GPU optimization for faster inference
- API rate limiting and usage tracking
- Backup and disaster recovery
Last Updated: June 2026
Version: 1.0.0
Status: Production-Ready