Skip to content

saadamir1/multimodal-rag-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Multimodal RAG System

A comprehensive Retrieval-Augmented Generation (RAG) system that processes heterogeneous documents containing text, tables, and images to provide intelligent question-answering capabilities.

🌟 Features

  • Multimodal Document Processing: Handles PDFs with text, tables, and images
  • Advanced Embeddings:
    • CLIP for image embeddings
    • MiniLM for text embeddings
  • Multiple Reasoning Strategies:
    • Chain of Thought (CoT)
    • Few-shot prompting
    • Zero-shot prompting
  • Vector Storage: ChromaDB for efficient similarity search
  • Generation Model: TinyLlama for response generation
  • Comprehensive Evaluation: BLEU, ROUGE, precision, recall metrics

🏗️ Architecture

Documents (PDF) → Content Extraction → Embedding Generation → Vector Storage → Query Processing → LLM Generation

📊 Performance Metrics

  • Document Processing: 569 chunks extracted from 3 heterogeneous documents
  • Retrieval Quality: High precision with vector similarity search
  • Generation Performance:
    • CoT: More comprehensive answers (14.91s latency)
    • Zero-shot: Higher ROUGE scores (9.05s latency)
    • Few-shot: Balanced performance

🚀 Quick Start

Prerequisites

pip install torch torchvision
pip install sentence-transformers transformers
pip install chromadb
pip install PyMuPDF pillow pytesseract
pip install rouge-score nltk
pip install matplotlib scikit-learn

Usage

from rag_system import RAGSystem

# Initialize the system
rag = RAGSystem()

# Process documents
pdf_files = ["document1.pdf", "document2.pdf", "document3.pdf"]
rag.ingest_documents(pdf_files)

# Query with different strategies
result = rag.query(
    text_query="What was the company revenue in 2023?",
    strategy="cot",  # or "few_shot", "zero_shot"
    top_k=5
)

print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

📈 Evaluation Results

Retrieval Performance

  • Precision@K: Variable based on query complexity
  • Recall@K: Optimized for relevant document retrieval
  • Mean Average Precision: Consistent across document types

Generation Quality by Strategy

  • Chain of Thought: Best for complex reasoning tasks
  • Zero-Shot: Highest lexical overlap (ROUGE scores)
  • Few-Shot: Balanced approach for factual queries

System Performance

  • Average processing time: ~12s per query
  • Embedding generation: 384-dimensional vectors
  • Vector database: Cosine similarity search

🔧 Components

Document Processing (PDFProcessor)

  • Text extraction from PDF pages
  • Table detection and extraction
  • Image extraction with OCR using Tesseract

Embedding Generation (EmbeddingGenerator)

  • Text embeddings: all-MiniLM-L6-v2
  • Image embeddings: openai/clip-vit-base-patch32
  • Consistent 384-dimensional output

Vector Database (VectorDB)

  • ChromaDB persistent storage
  • Cosine similarity search
  • Metadata tagging for source tracking

Language Model (LLMProcessor)

  • Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Multiple prompting strategies
  • Automatic evaluation metrics

📋 Supported Document Types

  • Text: Paragraphs, headings, bullet points
  • Tables: Financial data, structured information
  • Images: Charts, diagrams, photos (with OCR)

🎯 Use Cases

  • Financial report analysis
  • Academic paper review
  • Technical documentation Q&A
  • Multi-format document search
  • Research paper summarization

🔍 Advanced Features

Embedding Visualization

  • t-SNE visualization of embedding space
  • Document type clustering analysis
  • Interactive exploration of vector relationships

Evaluation Metrics

  • BLEU scores for generation quality
  • ROUGE scores for content overlap
  • Precision/Recall for retrieval accuracy
  • Latency analysis across strategies

📊 Sample Results

Query: "What was the company revenue in 2023?"
Strategy: Chain of Thought
Answer: Based on the financial documents, the company revenue in 2023 was $10.5 million, representing a 15% growth from the previous year.
Sources: financials.pdf (page 3), annual_report.pdf (page 12)
Latency: 14.2s

🛠️ Configuration

The system can be configured for different use cases:

  • Model Selection: Switch between different embedding models
  • Chunk Size: Adjust text segmentation parameters
  • Retrieval Count: Modify top-k results
  • Generation Parameters: Temperature, max tokens, beam search

📝 Citation

If you use this work in your research, please cite:

@misc{multimodal-rag-2024,
  title={Multimodal Retrieval-Augmented Generation System},
  author={Saad Amir},
  year={2024},
  howpublished={\\url{https://github.com/yourusername/multimodal-rag-system}}
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • OpenAI CLIP for multimodal embeddings
  • Sentence Transformers for text embeddings
  • ChromaDB for vector storage
  • TinyLlama for efficient generation

About

Advanced Retrieval-Augmented Generation system supporting multimodal document processing (text, tables, images) with multiple reasoning strategies and comprehensive evaluation framework.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors